Enable job alerts via email!

Lead / Head of SRE

CENTRIC SOFTWARE

London

On-site

GBP 100,000 - 130,000

Full time

Yesterday
Be an early applicant

Job summary

A global technology company in London is looking for an experienced Lead / Head of Site Reliability Engineering (SRE). This role involves establishing and leading a global SRE organization, building practices from the ground up, and working closely with engineering and product teams to ensure scalable and resilient systems. The ideal candidate has 7+ years in SRE, proven leadership skills, and hands-on experience with AWS and infrastructure automation.

Qualifications

  • 7+ years of experience in Site Reliability Engineering or related fields.
  • Proven experience building SRE functions from scratch or scaling them.
  • Hands-on expertise with AWS services and cloud-native architectures.
  • Strong background in infrastructure-as-code and automation.

Responsibilities

  • Lead and scale a global SRE organization.
  • Define and enforce SLAs, SLOs, and SLIs.
  • Build incident management processes and champion a blameless culture.
  • Collaborate with Product, Engineering, and Security teams.

Skills

Site Reliability Engineering
AWS services
Infrastructure as Code
Leadership
Communication
Coding/Scripting

Tools

Terraform
CloudFormation
CI/CD tooling
Job description

Job Title: Lead / Head of SRE

Overview

We are seeking an experienced Lead / Head of Site Reliability Engineering (SRE) to establish, scale, and lead a global SRE organization for our greenfield platform. This is a unique opportunity to build SRE practices, culture, and tooling from the ground up while partnering closely with engineering, product, and security teams to ensure our systems are scalable, resilient, and secure.

This role is both strategic and hands-on — you will not only define and execute the SRE vision but also be deeply involved insupporting infrastructure and applications in AWS.

Responsibilities
  1. Strategic Leadership & Team Building

  2. Build, scale, and lead a global SRE organization across multiple time zones.

  3. Hire, mentor, and develop top SRE talent, fostering a culture of operational excellence, collaboration, and continuous improvement.

  4. Define and own the SRE vision, roadmap, and success metrics in alignment with company goals.

  1. Operational Excellence & Process Design

  2. Establish and document all SRE processes, runbooks, and playbooks from scratch for a greenfield environment.

  3. Define and enforce SLAs, SLOs, and SLIs, ensuring measurable reliability and availability targets.

  4. Build and implement incident management processes, including on-call rotations, escalation paths, and postmortem practices.

  5. Champion a blameless culture and lead root cause analyses to drive systemic improvements.

  1. Hands-On Technical Leadership

  2. Lead application support efforts — monitor, troubleshoot, and resolve production issues in collaboration with engineering teams.

  3. Contribute to the development of tooling, scripts, and automation to eliminate toil and streamline operations.

  4. Build and maintain observability stacks (metrics, logging, tracing) and ensure actionable alerting.

  5. Drive cost optimization, performance tuning, and capacity planning for infrastructure and applications.

  1. Cross-Functional Collaboration

  2. Partner with Product, Engineering, and Security teams to ensure resiliency is built into every stage of the development lifecycle.

  3. Act as the primary advocate for reliability and operational efficiency within the organization.

  4. Report on key reliability metrics and provide high level insights into system health.

Qualifications
  • 7+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering or Operational Support, with at least 3+ years in a leadership role.

  • Proven experience building SRE or Operational Support functions from scratch or scaling them.

  • Hands-on expertise with AWS services (EC2, ECS/EKS, Lambda, VPC, RDS, S3, IAM, CloudWatch, etc.) and cloud-native architectures.

  • Strong background in infrastructure-as-code (Terraform, CloudFormation), CI/CD tooling, and automation.

  • Proficiency in application support and development practices, including debugging, performance tuning, and collaborating with software engineers.

  • Deep understanding of reliability engineering principles, incident response, observability, and security best practices.

  • Strong coding/scripting skills in languages like Python, Go, or Bash.

  • Excellent leadership, communication, and stakeholder management skills.

  • Track record of defining SLAs/SLOs, improving MTTR, and driving automation initiatives.

  • Passionate about mentorship, process improvement, and building high-performing teams.

Centric Software provides equal employment opportunities to all qualified applicants without regard to race, sex, sexual orientation, gender identity, national origin, color, age, religion, protected veteran or disability status or genetic information.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs