We are seeking a skilled and motivated Site Reliability Engineering Officer (SRE) to join our Infrastructure team. This role focuses on operating, automating, and ensuring the reliability, scalability, and security of cloud and containerized platforms.
You will play a key role in maintaining high service availability and driving operational excellence across our cloud environments.
Key Responsibilities
- Operate, monitor, and maintain cloud-native infrastructure across OCI, GCP, and private cloud environments, ensuring high availability and scalability.
- Deploy, manage, and support containerized workloads using Kubernetes and Docker.
- Implement and manage GitOps practices using GitLab CI/CD for automated and auditable deployments.
- Build, manage, and maintain Infrastructure as Code (IaC) using Terraform, ensuring compliance with best practices.
- Automate operational tasks and configuration management using Ansible.
- Implement and maintain monitoring, logging, and observability solutions using Prometheus, ELK, and alerting frameworks.
- Develop and maintain operational runbooks, automation scripts, and technical documentation.
- Participate in incident response, root cause analysis, and post-incident reviews.
- Collaborate closely with development, security, and platform teams to improve service reliability and efficiency.
- Apply SRE principles including SLIs, SLOs, and error budgets to continuously improve system reliability.
- Enforce cloud, Kubernetes, and security best practices in line with governance and compliance requirements.
Required Skills & Qualifications
- 2+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering roles.
- Hands-on experience with OCI, GCP, and private cloud environments.
- Strong experience with Kubernetes and Docker.
- Proficiency in GitLab CI/CD, Git, and GitOps workflows.
- Solid experience using Terraform for infrastructure provisioning.
- Strong automation and configuration management skills using Ansible.
- Experience with monitoring and observability tools such as Prometheus and ELK.
- Proficiency in scripting languages such as Bash and Python.
- Good understanding of cloud and Kubernetes security best practices.
Preferred Qualifications
- Experience with hybrid-cloud or multi-cloud architectures.
- Practical knowledge of SRE practices (SLIs, SLOs, error budgets, incident management).
- Experience with code quality and static analysis tools such as SonarQube.
- Relevant certifications such as CKA, CKAD, Terraform Associate, OCI or GCP Cloud certifications.