Enable job alerts via email!

Site Reliability Engineer (Expert) 0630

Opensource Intelligent Solutions

Gauteng

On-site

ZAR 800,000 - 1,200,000

Full time

14 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking an Expert Site Reliability Engineer (SRE) to enhance service reliability and drive operational excellence. In this role, you'll manage scalable infrastructure solutions, monitor observability, and lead incident management. Ideal candidates will have extensive experience with AWS and cloud technologies, alongside expertise in Docker, Kubernetes, and CI/CD tools.

Qualifications

  • 10+ years experience in SRE, DevOps, or similar roles.
  • Skilled with cloud-native technologies.
  • Experience with Docker, Kubernetes, CI/CD, and GitOps.

Responsibilities

  • Design and implement scalable infrastructure solutions.
  • Lead major incident response and drive improvements.
  • Mentor team members and influence engineering decisions.

Skills

AWS
Containerization (Docker, Kubernetes)
CI/CD
GitOps (Flux / ArgoCD)
Monitoring tools (Grafana, Prometheus, Loki, Tempo)

Tools

Terraform
PostgreSQL
MongoDB

Job description

Hiring : Expert Site Reliability Engineer (SRE)

Are you passionate about scalability, automation, and reliability?

We're looking for an Expert Site Reliability Engineer (SRE) to join our engineering team, driving operational excellence across our platform.

Why Join Us?
  • Shape our observability strategy and implement automation at scale
  • Collaborate with development teams to enhance service reliability
  • Lead incident response and drive systematic improvements
Qualifications
  • 10+ years in SRE, DevOps, or similar roles
  • Skilled with AWS and cloud-native technologies
  • Experience with Docker, Kubernetes, CI / CD, and GitOps (Flux / ArgoCD)
  • Knowledge of monitoring tools (Grafana, Prometheus, Loki, Tempo)
Bonus Skills
  • Experience with Terraform, PostgreSQL, MongoDB
  • Expertise in performance optimization & cost management
  • Security hardening & compliance implementation
Tech Stack You'll Work With
  • Observability: Grafana Stack, Prometheus
  • Infrastructure: Cloud-native technologies
  • CI / CD: Modern pipeline tools
Key Responsibilities
  • System Reliability: Design and implement scalable infrastructure solutions
  • Observability: Architect and maintain monitoring & alerting systems
  • Automation: Develop automated workflows to reduce manual effort
  • Incident Management: Lead major incident response and drive improvements
  • Technical Leadership: Mentor team members and influence engineering decisions
  • Tool Development: Build internal tools to enhance operational efficiency
  • Best Practices: Establish and enforce SRE methodologies

Ready to take on this challenge? Apply now with your latest and detailed CV!

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.