Enable job alerts via email!

Senior Site Reliability Engineer

Avance Consulting

London

On-site

GBP 60,000 - 90,000

Full time

Yesterday

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking a Senior Site Reliability Engineer in London to manage cloud infrastructure and improve deployment pipelines. You will work on ensuring system reliability and automated processes, collaborating with various teams for scalable solutions. The role requires solid DevOps skills with a focus on AWS services and incident management.

Qualifications

Hands-on experience with AWS services at a DevOps Engineer level.
Strong background in enterprise observability tooling like Prometheus and Grafana.
Proficient in scripting languages such as Python, Go, or Bash.

Responsibilities

Deploy and monitor AWS services ensuring high availability and security.
Handle incidents with root cause analysis and preventive measures.
Optimize CI/CD pipelines for automated deployments.

Skills

AWS services

Incident management

Observability tooling

Python

GitHub

CI/CD

Social network you want to login/join with:

Senior Site Reliability Engineer, London

col-narrow-left

Client:

Avance Consulting

Location:

London, United Kingdom

Job Category:

Other

EU work permit required:

Yes

col-narrow-right

Job Reference:

7cd2e309a4a5

Job Views:

Posted:

25.06.2025

Expiry Date:

09.08.2025

col-wide

Job Description:

The Role

As a DevOps Engineer, you will play a critical role in managing cloud infrastructure, ensuring the reliability of production systems, and improving end-to-end deployment pipelines. This role combines deep operational responsibilities with a strong focus on automation, observability, and continuous improvement. You will be responsible for maintaining high system availability, enabling rapid delivery through CI/CD, and supporting development teams with robust infrastructure and tooling. A key part of the role includes proactive monitoring using Prometheus, Grafana, and Splunk, as well as participating in on-call rotations to respond to live incidents. Collaboration across engineering, security, and product teams is essential to build scalable and resilient systems.

Your responsibilities:

1. Deploy, configure, and monitor AWS services ensuring high availability, scalability, and security.

2. Respond to and resolve infrastructure and service incidents with root cause analysis and preventive measures.

3. Handle change requests, track recurring issues, and work on long-term fixes to improve system stability.

4. Implement and maintain observability solutions using Prometheus, Grafana, and Splunk.

5. Write PromQL queries for custom monitoring dashboards, alerting, and diagnostics.

6. Manage and optimize CI/CD pipelines for automated testing, deployment, and rollback strategies.

7. Develop and maintain automation scripts in Python, Bash, Go, or SQL for routine infrastructure tasks.

8. Utilize Git-based workflows for infrastructure changes, version control, and automated deployments.

9. Operate, troubleshoot, and optimize Kubernetes clusters and containerized workloads.

10. Participate in a rotating on-call schedule to ensure 24/7 availability of production systems.

Your Profile

Essential skills/knowledge/experience:

1. Working knowledge and prior hands-on experience using AWS services at the DevOps Engineer level

2. Incident, change & problem management experience. This role is heavily operation-oriented, including on-call requirements

3. Strong background in setup & operation of enterprise observability tooling, specifically Prometheus, Grafana and Splunk, including usage of PromQL

4. Proficient in one or more languages of Python, Go, Bash, SQL

5. Familiar with GitHub / GitOps / container orchestration / Kubernetes operations

6. Working configuration and deployment management experience with CI/CD

Desirable skills/knowledge/experience: (As applicable)

1. Hands-on experience with Terraform or CloudFormation for infrastructure provisioning and automation.

2. Strong knowledge of Splunk for log analysis and troubleshooting.

3. Strong problem-solving skills and analytical thinking.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs