Enable job alerts via email!

Principal Site Reliability Engineer

TN United Kingdom

London

Hybrid

GBP 80,000 - 110,000

Full time

Yesterday

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in organizational design seeks a Principal Site Reliability Engineer to enhance their AWS and Kubernetes infrastructure. This role combines technical leadership with strategic vision, focusing on reliability and collaboration across teams. You'll implement best practices in observability and incident management while mentoring engineers and driving automation efforts.

Benefits

Subsidised Gym Membership

Private Medical Insurance

25 days holiday

Summer Fridays

Employer pension contribution

Season ticket Loan

Cycle to Work Scheme

Annual Discretionary Bonus

Qualifications

Experience leading SRE transformations.
Hands-on expertise with Kubernetes in production environments.
Strong background in observability and CI/CD pipelines.

Responsibilities

Define and enforce SLOs, SLIs, and error budgets.
Guide the team in building automated, self-healing systems.
Drive Infrastructure as Code using Terraform and Kubernetes.

Skills

Kubernetes

AWS core services

Infrastructure as Code

observability

automation

Tools

Terraform

GitOps

Social network you want to login/join with:

Principal Site Reliability Engineer, London

col-narrow-left

Client:

Orgvue

Location:

London, United Kingdom

Job Category:

EU work permit required:

Yes

col-narrow-right

Job Reference:

465704a68d8a

Job Views:

Posted:

14.05.2025

Expiry Date:

28.06.2025

col-wide

Job Description:

Orgvue is an organisational design and planning platform that empowers your business to transform its workforce by understanding the work people do and the skills they have. Our platform connects strategy to structure, providing clarity of vision, so you can build a more adaptable, better performing organisation that thrives in a constantly changing world of work.

The world’s largest and best-known enterprises and consulting firms use Orgvue to visualise and model current and future states of the organisation and make faster, more informed decisions. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney.

As a Principal Site Reliability Engineer, you will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure. You will work across product, platform, and operations teams to ensure our systems are reliable, observable, and resilient — even at scale.

This role combines hands-on technical capability with strategic vision, helping us build a world-class reliability culture and a robust engineering foundation for growth. We're looking for someone who has technical expertise, is a great communicator and enjoys collaborating across multiple teams.

As a Lead Software Engineer, you will:

Define and enforce SLOs, SLIs, and error budgets across critical services
Crafting and implementing a cloud infrastructure and tooling strategy
Work across our Org to level up SRE practices
Help implement robust observability metrics, logs & traces using our observability tool
Guide the team in building automated, self-healing systems
Own and evolve our incident response processes, including on-call practices and post-mortem culture
Mentor engineers across the org on best practices in reliability, operational readiness, and scalable infrastructure
Drive Infrastructure as Code (IaC) using Terraform, Kubernetes, CloudFormation and GitOps practices
Collaborate closely with security, DevOps, and software teams to ensure compliance, scalability, and operational excellence
Evaluate and introduce tools, patterns, and practices that improve the performance and reliability of our SaaS platform

Requirements

Desired Skills & Experience:

Demonstrable experience leading SRE transformations
Deep hands-on expertise with Kubernetes (EKS preferred) in production environments
Strong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)
Expert in Infrastructure as Code using tools such as Terraform, with knowledge of GitOps workflows
Strong background in observability: metrics, visualization, logging, and tracing
Understanding of automation, SDLC, CI/CD pipelines, deployment automation, and blue/green or canary releases
Proven experience with incident management, disaster recovery planning, root cause analysis, and post-incident reviews
Hybrid working - 1+ days a week in the London office
Subsidised Gym Membership
Private Medical Insurance (including Dental and Vision) and Life Assurance
25 days holiday (increasing to 30 days at a rate of 1 extra day per year)
Summer Fridays (half-day Fridays for the months of July and August)
Employer pension contribution of 5% of your gross salary, if you contribute a minimum of 3%
Season ticket Loan
Cycle to Work Scheme
Annual Discretionary Bonus

'Here at Orgvue we promote individualism and a diverse workforce to build on our future success'

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs