What you will do:
- Design, implement, and maintain scalable, reliable, and secure infrastructure using cloud technologies, currently GCP.
- Develop and automate monitoring, alerting, and incident response processes to ensure the highest service availability level.
- Collaborate with development teams to enhance the reliability and performance of applications through best practices and automation.
- Manage and resolve software development incidents or system failures by performing root cause analysis, implementing timely fixes, corrective measures, and conducting post-mortems to prevent future occurrences.
- Develop and maintain comprehensive documentation for infrastructure, processes, and procedures.
- Participate in on-call rotations to provide 24/7 support for critical systems and respond to incidents promptly.
- Continuously improve system observability and monitoring using tools such as Prometheus, Grafana, Datadog, etc.
- Implement and manage CI/CD pipelines to streamline the deployment process and ensure rapid, reliable software releases.
- Drive initiatives to optimize the cost, performance, and security of the infrastructure.
- Stay up-to-date with industry trends and best practices in site reliability engineering and cloud technologies.
Your experience and skills
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role.
- Strong knowledge of cloud platforms (AWS, GCP, Azure) and cloud-native technologies.
- Proficiency in scripting and automation using languages such as JavaScript, NodeJS, Python, Go, Bash, or similar.
- Experience with configuration management tools (Terraform, Ansible, Chef, Puppet).
- Solid understanding of networking concepts and protocols.
- Familiarity with containerization technologies (Docker, Kubernetes).
- Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, ELK stack).
- Strong problem-solving skills and the ability to troubleshoot complex issues in a distributed system.
- Excellent communication and collaboration skills.
- Ability to work in a fast-paced, dynamic environment and manage multiple priorities.
- Experience with microservices architecture and related technologies.
- Knowledge of database administration and optimization (SQL, NoSQL).
- Familiarity with security best practices and compliance standards.
- Contributions to open-source projects or active participation in the SRE community.