Enable job alerts via email!

Site Reliability Engineer

ELLIOTT MOSS CONSULTING PTE. LTD.

Penarth

Hybrid

GBP 60,000 - 90,000

Full time

Yesterday

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology consulting firm in the United Kingdom is seeking a highly skilled Site Reliability Engineer (SRE) to enhance its enterprise observability and reliability platforms. This role involves ensuring the performance and scalability of cloud-native applications and leading the integration of reliability practices. The ideal candidate should have strong hands-on experience with tools such as Prometheus and Kubernetes. This position supports a proactive reliability engineering culture across teams, and while the workweek is Monday to Friday, some weekend support may be necessary.

Qualifications

Strong hands-on experience with monitoring and observability tools.
Solid understanding of SRE principles including SLIs and SLOs.
Experience supporting production distributed systems.

Responsibilities

Define and implement SLIs, SLOs, and error budgets for applications.
Drive reliability decision-making using performance metrics.
Lead incident response and post-incident reviews.
Own and operate open-source observability platforms.
Enhance observability solutions for scalability and resilience.
Deploy and manage workloads on Kubernetes platforms.
Operate logging and alerting platforms.
Create dashboards for service health and reliability.

Skills

Prometheus

Grafana

Elasticsearch

Kibana

OpenTelemetry

Jaeger

Zipkin

Kubernetes

OpenShift

Linux OS troubleshooting

CI/CD pipelines

Automation

Job Description

We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms.

This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and OpenShift.

The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.

Key Responsibilities

Reliability Engineering & SRE Practices: Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.
Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics.
Proactively identify reliability risks and performance bottlenecks and drive remediation.
Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.
Observability Platform Ownership: Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing.
Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.
Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization.
Operate distributed tracing platforms such as OpenTelemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.
Kubernetes & OpenShift Reliability: Support and enable application teams to migrate workloads to newer OpenShift/Kubernetes versions.
Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms.
Improve platform reliability through automation, self-healing, and standardized deployment patterns.
Partner with developers to implement application instrumentation and reliability best practices.
Logging, Alerting & Incident Response: Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.
Design and maintain actionable alerting aligned to SLOs and business impact.
Integrate alerting platforms with PagerDuty, Microsoft Teams, and other incident management tools.
Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.
Dashboards & Service Visibility: Deploy and administer visualization tools such as Grafana and Kibana.
Create standardized, reusable dashboards for service health, reliability, and capacity planning.
Implement and manage RBAC across observability platforms.
Infrastructure, Security & Automation: Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.
Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).
Build and maintain CI/CD pipelines for observability and reliability tooling.
Extend pipelines to support multiple environments and regions with consistency and repeatability.
Reliability Culture & Enablement: Champion an SRE and observability-first culture across engineering teams.
Coach teams on golden signals, service health modeling, and reliability trade-offs.
Enable teams to move from reactive monitoring to proactive reliability engineering.

Required Skills & Experience

Core Technical Skills Strong hands-on experience with: Prometheus, Grafana; Elasticsearch, Kibana (cluster operations, ILM, tuning); OpenTelemetry, Jaeger, Zipkin; Kubernetes & OpenShift; Linux OS troubleshooting; CI/CD pipelines and automation
Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.
Experience supporting production, highly available, distributed systems.
Working Hours: Monday to Friday, 9:00 AM – 6:00 PM. Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top locations

Top companies

Top positions