Enable job alerts via email!

Site Reliability Engineer (AWS)

Insight

Birmingham

On-site

GBP 60,000 - 80,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in Birmingham seeks a Senior SRE Engineer to ensure platform reliability and scalability. The role involves incident management, automation of operations, and collaboration with cross-functional teams to enhance performance. Candidates should have strong technical skills, be proactive, and have experience with tools like Terraform and Dynatrace. This position offers the opportunity to mentor junior engineers and influence platform improvements through technical excellence.

Qualifications

Strong experience in managing platform performance and reliability.
Experience with incident management and post-incident reviews.
Familiarity with observability tools and automation frameworks.

Responsibilities

Ensure performance and reliability SLAs are met.
Act as the primary responder for critical incidents.
Build and maintain automation tools for operational tasks.

Skills

Technical acumen

Proactive mindset

Ability to influence improvements

Automation experience

Infrastructure as Code (IaC)

CI/CD practices

Tools

Pulumi

Terraform

Dynatrace

Prometheus

Grafana

As a Senior SRE Engineer, you will be a hands‑on technical expert driving the reliability, scalability, and availability of the engineering platform. Working collaboratively across teams, you will develop and implement automated solutions, address operational challenges, and ensure the platform’s robust performance. This role demands strong technical acumen, a proactive mindset, and the ability to influence platform improvements through technical excellence.

Job Responsibilities

Platform Stability and Reliability

Ensure the platform meets performance, availability, and reliability SLAs.
Proactively identify and resolve performance bottlenecks and risks in production environments.
Maintain and improve monitoring, logging, and alerting frameworks to detect and prevent incidents.

Incident Management

Act as the primary responder for critical incidents, ensuring rapid mitigation and resolution.
Conduct post‑incident reviews and implement corrective actions to prevent recurrence.
Develop and maintain detailed runbooks and playbooks for operational excellence.

Automation and Efficiency

Build and maintain tools to automate routine tasks, such as deployments, scaling, and failover.
Contribute to CI/CD pipeline improvements for faster and more reliable software delivery.
Write and maintain Infrastructure as Code (IaC) using tools like Pulumi or Terraform to provision and manage resources.

Collaboration and Mentorship

Collaborate with SRE, CI/CD, Developer Experience, and Templates teams to improve the platform’s reliability and usability.
Mentor junior engineers by sharing knowledge and best practices in SRE and operational excellence.
Partner with developers to integrate observability and reliability into their applications.

Observability and Metrics

Implement and optimize observability tools like Dynatrace, Prometheus, and Grafana for deep‑to‑to‑a‑suite..
Define ... …..
??

"on?...???" <...>Please ignore.. "...."

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs