Enable job alerts via email!

Senior Software Reliability Engineer

Once For All UK

England

Remote

GBP 80,000 - 85,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading cloud-based SaaS company in the UK is seeking a Site Reliability Engineer to ensure production reliability of their microservices platform on Azure. The ideal candidate will have extensive experience with SRE practices, Azure, and Kubernetes. This fully remote role includes responsibilities such as defining SLOs and incident response, mentoring engineering teams, and improving system performance. The company offers competitive benefits and a supportive work environment.

Benefits

Private Medical Insurance

Pension

25 days holiday

Home office budget

Qualifications

10 years in SRE or production engineering roles running large-scale systems.
7 years hands-on with Microsoft Azure components and services.
5 years experience in infrastructure as code with Terraform or Bicep.

Responsibilities

Define SLOs and manage incident response.
Architect resilient workloads on Azure and AKS.
Secure the infrastructure with Managed Identity and Key Vault.

Skills

Microsoft Azure

Kubernetes

Infrastructure as Code

Python

Terraform

Tools

AKS

Terraform

Azure DevOps

Docker

Overview

Once For All is a high-growth cloud-based SaaS subscription business. Our technology helps our customers to manage their supply chain governance risk management and compliance. We work across public and private sector and have over 250k customers across the UK across 20 different sectors including construction transport retail hospitality education facility and property management manufacturing local and central government.

Role Summary

Join our Reliability and Platform group partnering with 10 Agile SCRUM teams to scale and harden a suite of microservices on Microsoft Azure. You will own production reliability for tier-1 services set and track SLOs automate operations and lead incident response to keep our next-generation Supplier Risk Assessment platform fast secure and available. This role is fully remote role.

Job Responsibilities

Define SLOs SLIs and error budgets for critical services.
Architect resilient multi-region and zone-aware workloads on Azure and AKS.
Build infrastructure as code with Terraform or Bicep. Enforce policy as code.
Design safe releases with progressive delivery automated rollbacks and feature flags.
Lead on-call rotations incident response postmortems and corrective actions.
Implement end-to-end observability : metrics logs traces dashboards alerts.
Plan capacity tune performance and optimize cost without impacting reliability.
Secure the stack with Managed Identity Key Vault workload identity and network segmentation.
Establish backup disaster recovery and tested restore procedures with clear RPO and RTO.
Mentor engineers and raise reliability standards across product teams

Candidate Requirements

10 years in SRE platform or production-facing engineering roles running large-scale systems.
7 years hands-on with Microsoft Azure : AKS Front Door or Application Gateway VNets Private Link Key Vault Monitor Log Analytics Application Insights Service Bus Storage SQL or Cosmos DB.
6 years operating Kubernetes in production including at least 3 years on AKS (network policies PodDisruptionBudgets HPA / VPA node pools upgrade playbooks).
5 years infrastructure as code with Terraform or Bicep and Git-based workflows.
5 years designing observability and SLO-based alerting using OpenTelemetry and Kusto queries.
4 years running canary or blue-green deployments in Azure DevOps or GitHub Actions.
Proven incident command experience with measurable MTTR and MTTD improvements.
Strong automation skills in Python or Go plus Bash and PowerShell.
Solid understanding of security hardening container image scanning SBOM and least privilege.
Experience with performance testing p95 and p99 tuning caching and connection pool strategies.

Nice To Have

Multi-tenant SaaS and data sovereignty patterns.
Service mesh eBPF or advanced traffic shaping.
Compliance and audit trail design.
FinOps practice with cost per request or per tenant KPIs.

What We Offer

Health and Wellbeing : Private Medical Insurance or wellness fund 24 / 7 Employee Assistance Programme.
Financial Benefits : Pension Life Assurance (3x salary).
Time Off : 25 days holiday 8 bank holidays holiday purchase scheme (5 days) paid and unpaid volunteering days.
Growth and Development : Ongoing CPD team offsites and company events.
Everyday Perks : Home office budget high-spec laptop and peripherals.
Work Setup : Fully remote within UK time zones optional access to our Basingstoke office.

Tech Stack You Will Use

Azure AKS Terraform or Bicep Azure DevOps or GitHub Actions Docker Helm Service Bus Storage SQL Server Cosmos DB Key Vault Azure Monitor Log Analytics Application Insights Prometheus Grafana OpenTelemetry Feature flagging tools.

Interview Process

Intro and role overview with Talent.
Technical deep dive on Azure and AKS architecture.
Practical exercise : propose SLOs and an alert plan for a sample service plus a release safety plan.
Culture and collaboration interview with Engineering.

Key Skills

Kubernetes,FMEA,Continuous Improvement,Elasticsearch,Go,Root cause Analysis,Maximo,CMMS,Maintenance,Mechanical Engineering,Manufacturing,Troubleshooting

Employment Type : Full Time

Experience : years

Vacancy : 1

Yearly Salary Salary : 80000 - 85000

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs