Enable job alerts via email!

Senior Site Reliability Engineer

Once For All Limited

Basingstoke

Remote

GBP 80,000 - 100,000

Full time

Yesterday
Be an early applicant

Job summary

A cloud-based SaaS company is seeking a Senior Site Reliability Engineer in Basingstoke. You will own production reliability for tier-1 services, automate operations, and lead incident response. The role requires extensive experience in Azure and Kubernetes, with the opportunity to work fully remotely within the UK. This position offers numerous benefits including health insurance and a home office budget.

Benefits

Private Medical Insurance
25 days holiday + 8 bank holidays
Home office budget

Qualifications

  • 10+ years in SRE, platform, or production-facing engineering roles running large-scale systems.
  • 7+ years hands-on with Microsoft Azure, including AKS.
  • 6+ years operating Kubernetes in production.
  • 5+ years infrastructure as code with Terraform or Bicep.
  • Strong automation skills in Python or Go.

Responsibilities

  • Define SLOs, SLIs, and error budgets for critical services.
  • Architect resilient multi-region workloads on Azure.
  • Build infrastructure as code with Terraform or Bicep.
  • Implement end-to-end observability: metrics, logs, traces.

Skills

10+ years in SRE or platform engineering
Hands-on with Microsoft Azure
Operating Kubernetes in production
Infrastructure as code with Terraform
Designing observability and SLO-based alerting
Strong automation skills in Python or Go
Security hardening knowledge

Tools

Microsoft Azure
Terraform
Kubernetes (AKS)
Azure DevOps
Job description

Once For All is a high-growth, cloud-based, SaaS subscription business. Our technology helps our customers to manage their supply chain governance, risk management and compliance. We work across public and private sector and have over 250k customers across the UK across 20 different sectors including construction, transport, retail, hospitality education, facility and property management, manufacturing, local and central government.

Role Summary

Join our Reliability and Platform group partnering with 10 Agile SCRUM teams to scale and harden a suite of microservices on Microsoft Azure. You will own production reliability for tier-1 services, set and track SLOs, automate operations, and lead incident response to keep our next-generation Supplier Risk Assessment platform fast, secure, and available. This role is fully remote role.

Job Responsibilities
  • Define SLOs, SLIs, and error budgets for critical services.
  • Architect resilient multi-region and zone-aware workloads on Azure and AKS.
  • Build infrastructure as code with Terraform or Bicep. Enforce policy as code.
  • Design safe releases with progressive delivery, automated rollbacks, and feature flags.
  • Lead on-call rotations, incident response, postmortems, and corrective actions.
  • Implement end-to-end observability: metrics, logs, traces, dashboards, alerts.
  • Plan capacity, tune performance, and optimize cost without impacting reliability.
  • Secure the stack with Managed Identity, Key Vault, workload identity, and network segmentation.
  • Establish backup, disaster recovery, and tested restore procedures with clear RPO and RTO.
  • Mentor engineers and raise reliability standards across product teams
Candidate Requirements
  • 10+ years in SRE, platform, or production-facing engineering roles running large-scale systems.
  • 7+ years hands-on with Microsoft Azure: AKS, Front Door or Application Gateway, VNets, Private Link, Key Vault, Monitor, Log Analytics, Application Insights, Service Bus, Storage, SQL or Cosmos DB.
  • 6+ years operating Kubernetes in production, including at least 3 years on AKS (network policies, PodDisruptionBudgets, HPA/VPA, node pools, upgrade playbooks).
  • 5+ years infrastructure as code with Terraform or Bicep and Git-based workflows.
  • 5+ years designing observability and SLO-based alerting using OpenTelemetry and Kusto queries.
  • 4+ years running canary or blue-green deployments in Azure DevOps or GitHub Actions.
  • Proven incident command experience with measurable MTTR and MTTD improvements.
  • Strong automation skills in Python or Go, plus Bash and PowerShell.
  • Solid understanding of security hardening, container image scanning, SBOM, and least privilege.
  • Experience with performance testing, p95 and p99 tuning, caching and connection pool strategies.
Nice To Have
  • Multi-tenant SaaS and data sovereignty patterns.
  • Service mesh, eBPF, or advanced traffic shaping.
  • Compliance and audit trail design.
  • FinOps practice with cost per request or per tenant KPIs.
What We Offer
  • Health and Wellbeing: Private Medical Insurance or wellness fund, 24/7 Employee Assistance Programme.
  • Financial Benefits: Pension, Life Assurance (3x salary).
  • Time Off: 25 days holiday + 8 bank holidays, holiday purchase scheme (+5 days), paid and unpaid volunteering days.
  • Growth and Development: Ongoing CPD, team offsites, and company events.
  • Everyday Perks: Home office budget, high-spec laptop and peripherals.
  • Work Setup: Fully remote within UK time zones, optional access to our Basingstoke office.
Tech Stack You Will Use

Azure, AKS, Terraform or Bicep, Azure DevOps or GitHub Actions, Docker, Helm, Service Bus, Storage, SQL Server, Cosmos DB, Key Vault, Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana, OpenTelemetry, Feature flagging tools.

Interview Process
  • Intro and role overview with Talent.
  • Technical deep dive on Azure and AKS architecture.
  • Practical exercise: propose SLOs and an alert plan for a sample service, plus a release safety plan.
  • Culture and collaboration interview with Engineering.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.