Enable job alerts via email!

Site Reliability Engineer (Expert) 0630

Opensource Intelligent Solutions

Gauteng

On-site

ZAR 800,000 - 1,200,000

Full time

14 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking an Expert Site Reliability Engineer (SRE) to enhance service reliability and drive operational excellence. In this role, you'll manage scalable infrastructure solutions, monitor observability, and lead incident management. Ideal candidates will have extensive experience with AWS and cloud technologies, alongside expertise in Docker, Kubernetes, and CI/CD tools.

Qualifications

10+ years experience in SRE, DevOps, or similar roles.
Skilled with cloud-native technologies.
Experience with Docker, Kubernetes, CI/CD, and GitOps.

Responsibilities

Design and implement scalable infrastructure solutions.
Lead major incident response and drive improvements.
Mentor team members and influence engineering decisions.

Skills

AWS

Containerization (Docker, Kubernetes)

CI/CD

GitOps (Flux / ArgoCD)

Monitoring tools (Grafana, Prometheus, Loki, Tempo)

Tools

Terraform

PostgreSQL

MongoDB

Hiring : Expert Site Reliability Engineer (SRE)

Are you passionate about scalability, automation, and reliability?

We're looking for an Expert Site Reliability Engineer (SRE) to join our engineering team, driving operational excellence across our platform.

Why Join Us?

Shape our observability strategy and implement automation at scale
Collaborate with development teams to enhance service reliability
Lead incident response and drive systematic improvements

Qualifications

10+ years in SRE, DevOps, or similar roles
Skilled with AWS and cloud-native technologies
Experience with Docker, Kubernetes, CI / CD, and GitOps (Flux / ArgoCD)
Knowledge of monitoring tools (Grafana, Prometheus, Loki, Tempo)

Bonus Skills

Experience with Terraform, PostgreSQL, MongoDB
Expertise in performance optimization & cost management
Security hardening & compliance implementation

Tech Stack You'll Work With

Observability: Grafana Stack, Prometheus
Infrastructure: Cloud-native technologies
CI / CD: Modern pipeline tools

Key Responsibilities

System Reliability: Design and implement scalable infrastructure solutions
Observability: Architect and maintain monitoring & alerting systems
Automation: Develop automated workflows to reduce manual effort
Incident Management: Lead major incident response and drive improvements
Technical Leadership: Mentor team members and influence engineering decisions
Tool Development: Build internal tools to enhance operational efficiency
Best Practices: Establish and enforce SRE methodologies

Ready to take on this challenge? Apply now with your latest and detailed CV!

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.