Enable job alerts via email!

Site Reliability Engineering Lead

Lulalend

Cape Town

On-site

ZAR 850 000 - 1 100 000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading financial technology firm is looking for an experienced Site Reliability Engineering Lead to manage their SRE team in Cape Town. This role involves overseeing Azure-based infrastructure, ensuring its reliability and security, and driving automation efforts. The ideal candidate should have over 5 years of experience in a cloud infrastructure role, with at least 2 years in a leadership capacity. A strong background in Azure services and tools is essential for success in this position.

Qualifications

5+ years of experience in a senior SRE, DevOps, or Cloud Infrastructure role.
Minimum 2+ years of formal people management and leadership experience.
Strong experience with Azure services like Web Applications and Functions.

Responsibilities

Lead and mentor the SRE team for high performance.
Manage the team's performance with clear goals and regular reviews.
Define and manage the SRE technical roadmap with cross-functional teams.

Skills

Leadership and mentoring

Incident response management

Troubleshooting and problem-solving

Automation with PowerShell and Azure CLI

Cloud infrastructure management

Azure services knowledge

Education

Matric certificate or equivalent

Tools

Azure Monitor

Grafana

Jira

OpsGenie

ARM templates

We are seeking an experienced Site Reliability Engineering Lead to lead, mentor, and grow our SRE team. The ideal candidate will have a deep understanding of Microsoft Azure, cloud computing, and distributed systems.

As the SRE Lead, you will be responsible for the overall strategy and execution of our SRE function. You will guide your team to monitor, maintain, and improve our Azure-based infrastructure and applications, ensuring their reliability, scalability, and security.

KEY RESPONSIBILITIES:

Lead, mentor, and develop a high-performing SRE team, fostering a culture of ownership, collaboration, and continuous improvement.
Manage the team's performance, including setting clear goals, conducting regular 1:1s, and supporting career development.
Collaborate with the software engineering manager on the recruitment process to grow the SRE team, ensuring a high bar for technical skill and cultural fit.
Own and manage the 24/7 on‑call rotation and incident response process, acting as a key escalation point and driving effective root cause analysis (RCA) and remediation plans.
Define and drive the SRE technical roadmap, partnering with Engineers, DevOps, and SecOps to build and manage highly available, scalable, and resilient architectures on Azure.
Oversee the platform's monitoring and alerting strategy, guiding the team to build a holistic view of infrastructure and application performance using tools like Azure Monitor.
Champion automation by directing the team's development of scripts and tools to streamline deployment and management of Azure services.
Drive platform optimisation by analysing performance metrics and evaluating new Azure features and services to improve workflows.
Ensure the security of the Azure infrastructure by enforcing security policies and best practices in partnership with the SecOps team.
Foster a culture of delivery, continuous improvement and innovation within the SRE team, encouraging experimentation.

THE EXPERIENCE WE’RE LOOKING FOR

Matric certificate or equivalent.
5+ years of experience in a senior SRE, DevOps, or Cloud Infrastructure role, with deep knowledge of maintaining Azure infrastructure.
Minimum 2+ years of formal people management and leadership experience.
Demonstrable experience leading incident response and root cause analysis.
Strong understanding of Azure services such as Web Applications, Functions, and Application Gateways.
Strong experience with automation tools such as PowerShell, Azure CLI, and ARM templates.
Deep experience with monitoring and logging tools such as Azure Monitor, Grafana or similar, Log Analytics, Application Insights, and Logic Apps.
Excellent troubleshooting, problem‑solving, and strategic planning skills.
Strong familiarity with DevOps practices and tools such as Jira and OpsGenie
Monitoring & Observability: Azure Monitor, Log Analytics and Grafana.
Operations & Incident Management: Jira, Sentinel and OpsGenie.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top locations

Top companies

Top positions