Enable job alerts via email!

Site Reliability Engineer

ZILO

London

Hybrid

GBP 60,000 - 90,000

Full time

3 days ago
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Start fresh or import an existing resume

Job summary

ZILO is seeking a Senior Site Reliability Engineer for a hybrid role that encompasses platform engineering and troubleshooting. The ideal candidate will ensure the stability and performance of cloud-native infrastructure, thrive in a DevOps culture, and possess extensive experience with AWS and application coding in Java and GoLang. A competitive salary and comprehensive benefits, including enhanced leave and flexible working options, are offered.

Benefits

Enhanced leave - 38 days including 8 UK Public Holidays
Private Health Care including family cover
Life Assurance – 5x salary
Flexible working - work from home and/or in office
Employee Assistance Program
Company Pension (Salary Sacrifice options available)
Access to training and development
Buy and Sell holiday scheme
Global mobility opportunities

Qualifications

  • 5+ years in an SRE, DevOps, or infrastructure role.
  • Deep hands-on experience with AWS, EKS/Kubernetes, and Terraform.
  • Strong familiarity with modern observability tooling and coding.

Responsibilities

  • Own patching, upgrades, and maintenance of AWS infrastructure.
  • Respond to production incidents from user-facing errors to backend service disruptions.
  • Design and execute Chaos Engineering experiments to validate system behavior.

Skills

AWS
EKS/Kubernetes
Terraform
Kafka
Java
GoLang
Python
React
.NET
Datadog

Job description

About:

Step forward into the future of technology with ZILO.

We’re here to redefine what’s possible in technology. While we’re trusted by the global Transfer Agency sector, our technology is truly flexible and designed to transform any business at scale. We’ve created a unified platform that adapts to diverse needs, offering the scalability and reliability legacy systems simply can’t match.

At ZILO, our DNA is built on Character, Creativity, and Craftsmanship. We face every challenge with integrity, explore new ideas with a curious mind, and set a high standard in every detail.

We are a team of dedicated professionals where everyone, regardless of their role, drives our progress and creates real impact. If you’re ready to shape the future, let’s talk.

About the Role

We’re looking for a Senior Site Reliability Engineer to join our SRE team. This is a hybrid role that blends deep platform engineering with application-level troubleshooting. You’ll be responsible for the stability, performance, and resilience of our cloud-native infrastructure while also being on the front line when issues affect our users and services.

This is a high-impact role ideal for someone who thrives in a modern DevOps culture, cares about both systems uptime and customer experience, and is comfortable working across infrastructure and application layers.

Key Responsibilities
Infrastructure Reliability & Operations
  • Own patching, upgrades, and maintenance of AWS and EKS infrastructure
  • Define and implement resilience and failover strategies for microservices and core platforms
  • Continuously monitor and improve system performance, cost-efficiency, and observability (LGTM stack / Datadog)
  • Partner with security teams on compliance and vulnerability remediation
Chaos Engineering & Resilience
  • Design and execute Chaos Engineering experiments.
  • Develop and track SLOs, SLIs, and error budgets for critical systems
  • Conduct resilience reviews and game days to validate system behavior under failure
Kafka & Eventing
  • Ensure Kafka clusters are optimally configured for performance and durability
  • Support producers/consumers and troubleshoot event delivery and retention issues
  • Monitor and tune partitioning, replication, throughput, and latency
Application-Level Incident Support
  • Respond to production incidents — from user-facing UI errors to backend service disruptions
  • Investigate issues across infrastructure, Kubernetes, logs, traces, and service code
  • Resolve incidents and support root causes (Java and GoLang services)
  • Contribute to postmortems and reliability engineering initiatives
Who You Are
Essential Experience
  • 5+ years in an SRE, DevOps, or infrastructure role
  • Deep hands-on experience with AWS, EKS/Kubernetes, and Terraform
  • Working knowledge of Kafka tuning, monitoring, and operational troubleshooting
  • Strong familiarity to be able to read code and trace failures in one or more of the following application languages
    • Java
    • GoLang
    • React
    • .NET
    • Python
  • Solid understanding of modern observability tooling (e.g., Datadog, Loki, Grafana)
  • Comfortable working on a shared on-call rotation
  • Enhanced leave - 38 days inclusive of 8 UK Public Holidays
  • Private Health Care including family cover
  • Life Assurance – 5x salary
  • Flexible working-work from home and/or in our London Office
  • Employee Assistance Program
  • Company Pension(Salary Sacrifice options available)
  • Access to training and development
  • Buy and Sell holiday scheme
  • The opportunity for “work from anywhere/global mobility”
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.