Enable job alerts via email!

Site Reliability Engineer

Orgvue Limited

London

Hybrid

GBP 70,000 - 110,000

Full time

Yesterday

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative company is seeking a Principal Site Reliability Engineer to lead the scaling of their AWS and Kubernetes infrastructure. This pivotal role combines technical expertise with strategic vision, fostering a culture of reliability and resilience. You will collaborate with cross-functional teams to enhance operational excellence, mentor engineers, and drive Infrastructure as Code initiatives. With a focus on observability and automation, you will play a critical role in ensuring the systems are robust and adaptable in a fast-paced environment. Join a forward-thinking organization that values individualism and diversity, and make a significant impact on their engineering foundation.

Benefits

Hybrid working

Wellbeing initiatives

Subsidised gym membership

Private medical insurance

25 days holiday

Summer Fridays

Employer pension contribution

Season ticket loan

Cycle to Work Scheme

Annual discretionary bonus

Qualifications

Strong experience with AWS services and Kubernetes in production.
Hands-on expertise in Infrastructure as Code and observability practices.

Responsibilities

Define SLOs and enhance SRE practices across the organization.
Develop cloud infrastructure strategies and implement observability metrics.

Skills

SRE transformations

Kubernetes (EKS preferred)

AWS core services

Infrastructure as Code (Terraform)

Observability practices

Automation and CI/CD

Incident management

Tools

Terraform

CloudFormation

GitOps

Orgvue is an organisational design and planning platform that empowers your business to transform its workforce by understanding the work people do and the skills they have. Our platform connects strategy to structure, providing clarity of vision, so you can build a more adaptable, better performing organisation that thrives in a constantly changing world of work.

The world’s largest and best-known enterprises and consulting firms use Orgvue to visualise and model current and future states of the organisation and make faster, more informed decisions. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney.

Role: Principal Site Reliability Engineer

You will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure. You will collaborate across product, platform, and operations teams to ensure our systems are reliable, observable, and resilient — even at scale.

This role combines hands-on technical skills with strategic vision, helping us build a world-class reliability culture and a robust engineering foundation for growth. We seek someone with technical expertise, excellent communication skills, and a collaborative spirit.

Responsibilities:

Define and enforce SLOs, SLIs, and error budgets across critical services
Develop and implement cloud infrastructure and tooling strategies
Enhance SRE practices across the organization
Implement robust observability metrics, logs, and traces using our observability tools
Guide the team in building automated, self-healing systems
Own and evolve incident response processes, including on-call practices and post-mortem culture
Mentor engineers on reliability, operational readiness, and scalable infrastructure best practices
Drive Infrastructure as Code (IaC) initiatives using Terraform, Kubernetes, CloudFormation, and GitOps practices
Collaborate with security, DevOps, and software teams to ensure compliance and operational excellence
Evaluate and adopt tools and practices to improve platform performance and reliability

Desired Skills & Experience:

Experience leading SRE transformations
Hands-on expertise with Kubernetes (EKS preferred) in production
Strong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)
Proficiency in Infrastructure as Code using Terraform and knowledge of GitOps workflows
Strong background in observability: metrics, visualization, logging, tracing
Understanding of automation, CI/CD pipelines, deployment automation, and release strategies
Experience with incident management, disaster recovery, root cause analysis, and post-incident reviews

Additional Benefits:

Hybrid working: 1+ days a week in London office
Wellbeing initiatives: coaching, fitness sessions, webinars, Wellbeing day
Subsidised gym membership
Private medical insurance, dental, vision, and life assurance
25 days holiday (increasing to 30)
Summer Fridays (half-days in July and August)
Employer pension contribution of 5% (if you contribute at least 3%)
Season ticket loan
Cycle to Work Scheme
Annual discretionary bonus

Here at Orgvue, we promote individualism and a diverse workforce to build our future success.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Site Reliability Engineer

Auros

Greater London

Remote

GBP 60,000 - 100,000

9 days ago