Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineer

ELLIOTT MOSS CONSULTING PTE. LTD.

Penarth

Hybrid

GBP 60,000 - 90,000

Full time

Yesterday
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology consulting firm in the United Kingdom is seeking a highly skilled Site Reliability Engineer (SRE) to enhance its enterprise observability and reliability platforms. This role involves ensuring the performance and scalability of cloud-native applications and leading the integration of reliability practices. The ideal candidate should have strong hands-on experience with tools such as Prometheus and Kubernetes. This position supports a proactive reliability engineering culture across teams, and while the workweek is Monday to Friday, some weekend support may be necessary.

Qualifications

  • Strong hands-on experience with monitoring and observability tools.
  • Solid understanding of SRE principles including SLIs and SLOs.
  • Experience supporting production distributed systems.

Responsibilities

  • Define and implement SLIs, SLOs, and error budgets for applications.
  • Drive reliability decision-making using performance metrics.
  • Lead incident response and post-incident reviews.
  • Own and operate open-source observability platforms.
  • Enhance observability solutions for scalability and resilience.
  • Deploy and manage workloads on Kubernetes platforms.
  • Operate logging and alerting platforms.
  • Create dashboards for service health and reliability.

Skills

Prometheus
Grafana
Elasticsearch
Kibana
OpenTelemetry
Jaeger
Zipkin
Kubernetes
OpenShift
Linux OS troubleshooting
CI/CD pipelines
Automation
Job description
Job Description

We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms.

This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and OpenShift.

The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.

Key Responsibilities
  • Reliability Engineering & SRE Practices: Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.
  • Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics.
  • Proactively identify reliability risks and performance bottlenecks and drive remediation.
  • Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.
  • Observability Platform Ownership: Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing.
  • Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.
  • Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization.
  • Operate distributed tracing platforms such as OpenTelemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.
  • Kubernetes & OpenShift Reliability: Support and enable application teams to migrate workloads to newer OpenShift/Kubernetes versions.
  • Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms.
  • Improve platform reliability through automation, self-healing, and standardized deployment patterns.
  • Partner with developers to implement application instrumentation and reliability best practices.
  • Logging, Alerting & Incident Response: Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.
  • Design and maintain actionable alerting aligned to SLOs and business impact.
  • Integrate alerting platforms with PagerDuty, Microsoft Teams, and other incident management tools.
  • Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.
  • Dashboards & Service Visibility: Deploy and administer visualization tools such as Grafana and Kibana.
  • Create standardized, reusable dashboards for service health, reliability, and capacity planning.
  • Implement and manage RBAC across observability platforms.
  • Infrastructure, Security & Automation: Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.
  • Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).
  • Build and maintain CI/CD pipelines for observability and reliability tooling.
  • Extend pipelines to support multiple environments and regions with consistency and repeatability.
  • Reliability Culture & Enablement: Champion an SRE and observability-first culture across engineering teams.
  • Coach teams on golden signals, service health modeling, and reliability trade-offs.
  • Enable teams to move from reactive monitoring to proactive reliability engineering.
Required Skills & Experience
  • Core Technical Skills Strong hands-on experience with: Prometheus, Grafana; Elasticsearch, Kibana (cluster operations, ILM, tuning); OpenTelemetry, Jaeger, Zipkin; Kubernetes & OpenShift; Linux OS troubleshooting; CI/CD pipelines and automation
  • Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.
  • Experience supporting production, highly available, distributed systems.
  • Working Hours: Monday to Friday, 9:00 AM – 6:00 PM. Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.