Enable job alerts via email!

Site Reliability Engineer

ZILO

London

Hybrid

GBP 60,000 - 90,000

Full time

3 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Start fresh or import an existing resume

Job summary

ZILO is seeking a Senior Site Reliability Engineer for a hybrid role that encompasses platform engineering and troubleshooting. The ideal candidate will ensure the stability and performance of cloud-native infrastructure, thrive in a DevOps culture, and possess extensive experience with AWS and application coding in Java and GoLang. A competitive salary and comprehensive benefits, including enhanced leave and flexible working options, are offered.

Benefits

Enhanced leave - 38 days including 8 UK Public Holidays

Private Health Care including family cover

Life Assurance – 5x salary

Flexible working - work from home and/or in office

Employee Assistance Program

Company Pension (Salary Sacrifice options available)

Access to training and development

Buy and Sell holiday scheme

Global mobility opportunities

Qualifications

5+ years in an SRE, DevOps, or infrastructure role.
Deep hands-on experience with AWS, EKS/Kubernetes, and Terraform.
Strong familiarity with modern observability tooling and coding.

Responsibilities

Own patching, upgrades, and maintenance of AWS infrastructure.
Respond to production incidents from user-facing errors to backend service disruptions.
Design and execute Chaos Engineering experiments to validate system behavior.

Skills

AWS

EKS/Kubernetes

Terraform

Kafka

Java

GoLang

Python

React

.NET

Datadog

About:

Step forward into the future of technology with ZILO.

We’re here to redefine what’s possible in technology. While we’re trusted by the global Transfer Agency sector, our technology is truly flexible and designed to transform any business at scale. We’ve created a unified platform that adapts to diverse needs, offering the scalability and reliability legacy systems simply can’t match.

At ZILO, our DNA is built on Character, Creativity, and Craftsmanship. We face every challenge with integrity, explore new ideas with a curious mind, and set a high standard in every detail.

We are a team of dedicated professionals where everyone, regardless of their role, drives our progress and creates real impact. If you’re ready to shape the future, let’s talk.

About the Role

We’re looking for a Senior Site Reliability Engineer to join our SRE team. This is a hybrid role that blends deep platform engineering with application-level troubleshooting. You’ll be responsible for the stability, performance, and resilience of our cloud-native infrastructure while also being on the front line when issues affect our users and services.

This is a high-impact role ideal for someone who thrives in a modern DevOps culture, cares about both systems uptime and customer experience, and is comfortable working across infrastructure and application layers.

Key Responsibilities

️ Infrastructure Reliability & Operations

Own patching, upgrades, and maintenance of AWS and EKS infrastructure
Define and implement resilience and failover strategies for microservices and core platforms
Continuously monitor and improve system performance, cost-efficiency, and observability (LGTM stack / Datadog)
Partner with security teams on compliance and vulnerability remediation

️ Chaos Engineering & Resilience

Design and execute Chaos Engineering experiments.
Develop and track SLOs, SLIs, and error budgets for critical systems
Conduct resilience reviews and game days to validate system behavior under failure

Kafka & Eventing

Ensure Kafka clusters are optimally configured for performance and durability
Support producers/consumers and troubleshoot event delivery and retention issues
Monitor and tune partitioning, replication, throughput, and latency

Application-Level Incident Support

Respond to production incidents — from user-facing UI errors to backend service disruptions
Investigate issues across infrastructure, Kubernetes, logs, traces, and service code
Resolve incidents and support root causes (Java and GoLang services)
Contribute to postmortems and reliability engineering initiatives

Who You Are

Essential Experience

5+ years in an SRE, DevOps, or infrastructure role
Deep hands-on experience with AWS, EKS/Kubernetes, and Terraform
Working knowledge of Kafka tuning, monitoring, and operational troubleshooting
Strong familiarity to be able to read code and trace failures in one or more of the following application languages

Java
GoLang
React
.NET
Python

Solid understanding of modern observability tooling (e.g., Datadog, Loki, Grafana)
Comfortable working on a shared on-call rotation

Enhanced leave - 38 days inclusive of 8 UK Public Holidays
Private Health Care including family cover
Life Assurance – 5x salary
Flexible working-work from home and/or in our London Office
Employee Assistance Program
Company Pension(Salary Sacrifice options available)
Access to training and development
Buy and Sell holiday scheme
The opportunity for “work from anywhere/global mobility”

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Site Reliability Engineer

ZILO

London

Hybrid

GBP 60,000 - 90,000

Full time

Job summary

Benefits

Qualifications

Responsibilities

Skills

Job description

Company

Services

Free resources

Support

Site Reliability Engineer

ZILO

London

Hybrid

GBP 60,000 - 90,000

Full time

Job summary

Benefits

Qualifications

Responsibilities

Skills

Job description

Follow us

Company

Services

Free resources

Support