Job Description
What you will do:
Design, implement, and maintain scalable, reliable, and secure infrastructure using cloud technologies currently GCP.
Develop and automate monitoring, alerting, and incident response processes to ensure the highest service availability level.
Collaborate with development teams to enhance the reliability and performance of applications through best practices and automation.
Manage and resolve software development incidents or system failures by performing root cause analysis, implementing timely fixes, corrective measures, and conducting post mortems to prevent future occurrences.
Develop and maintain comprehensive documentation for infrastructure, processes, and procedures.
Participate in on-call rotations to provide 24/7 support for critical systems and respond to incidents promptly.
Continuously improve system observability and monitoring using tools such as Prometheus, Grafana, Datadog, etc.
Implement and manage CI/CD pipelines to streamline the deployment process and ensure rapid, reliable software releases.
Drive initiatives to optimize the cost, performance, and security of the infrastructure.
Stay up-to-date with industry trends and best practices in site reliability engineering and cloud technologies.
Your experience and skills:
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
Strong knowledge of cloud platforms (AWS, GCP, Azure) and cloud-native technologies.
Proficiency in scripting and automation using languages such as Javascript, NodeJS, Python, Go, Bash, or similar.
Experience with configuration management tools (Terraform, Ansible, Chef, Puppet).
Solid understanding of networking concepts and protocols.
Familiarity with containerization technologies (Docker, Kubernetes).
Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, ELK stack).
Strong problem-solving skills and the ability to troubleshoot complex issues in a distributed system.
Excellent communication and collaboration skills.
Ability to work in a fast-paced, dynamic environment and manage multiple priorities.
Experience with microservices architecture and related technologies.
Knowledge of database administration and optimization (SQL, NoSQL).
Familiarity with security best practices and compliance standards.
Contributions to open-source projects or active participation in the SRE community.