Enable job alerts via email!

Site Reliability Engineer (SRE) - Application Support

ZILO

City Of London

Hybrid

GBP 100,000 - 125,000

Full time

17 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in London is seeking a Site Reliability Engineer to resolve complex production issues and ensure system reliability. The ideal candidate will have strong debugging skills across Java, Golang, or Python, and a good understanding of PostgreSQL and Kubernetes. This hybrid role requires regular office attendance and offers an attractive benefits package including enhanced leave, private healthcare, and flexible working options.

Benefits

Enhanced leave - 38 days inclusive of 8 UK Public Holidays

Private Health Care including family cover

Life Assurance – 5x salary

Flexible working - work from home and/or in our London Office

Employee Assistance Program

Company Pension (Salary Sacrifice options available)

Access to training and development

Buy and Sell holiday scheme

Work from anywhere/global mobility opportunity

Qualifications

Solid experience with application debugging in Java, Golang, or Python.
Good grasp of PostgreSQL for data fixing and analysis.
Familiarity with Kubernetes and cloud platforms.

Responsibilities

Investigate and resolve incidents raised by clients.
Debug applications and trace issues in Java, Golang, and Python.
Perform data fixes using PostgreSQL.
Patch and maintain Kubernetes clusters and production systems.
Contribute to observability and reliability improvements.

Skills

Application debugging in Java

Application debugging in Golang

Application debugging in Python

PostgreSQL

Kubernetes

AWS

GCP

Azure

Incident management

Observability tools

Tools

AWS

GCP

Azure

About

Step forward into the future of technology with ZILO™.

We’re here to redefine what’s possible in technology. While we’re trusted by the global Transfer Agency sector, our technology is truly flexible and designed to transform any business at scale. We’ve created a unified platform that adapts to diverse needs, offering the scalability and reliability legacy systems simply can’t match.

At ZILO™, our DNA is built on Character, Creativity, and Craftsmanship. We face every challenge with integrity, explore new ideas with a curious mind, and set a high standard in every detail.

We are a team of dedicated professionals where everyone, regardless of their role, drives our progress and creates real impact. If you’re ready to shape the future, let’s talk.

Requirements

We’re looking for a Site Reliability Engineer to join our SRE team — someone who thrives on solving complex production issues, understands how applications behave in the real world, and takes pride in keeping systems reliable and performant.

This is not a platform engineering role. You won’t just be spinning up Kubernetes clusters or building infrastructure — you’ll be deeply involved in understanding our applications, what they do and how they operate, troubleshooting real-world issues, and working directly on improvements that impact our customers every day.

What You’ll Do

Incident Response & Troubleshooting: Investigate and resolve incidents raised by clients, diving into logs, metrics, and application code to identify root causes.
Application Debugging: Work across our core stack — Java, Golang, and Python — to trace and fix issues affecting reliability or performance.
Data Fixes: Perform data investigation and fixes using Postgres.
Operational Excellence: Patch and maintain Kubernetes clusters and other production systems.
SRE Roadmap: Contribute to the continuous improvement of our observability, reliability, and automation initiatives.

This role is hybrid and will require regular weekly attendance at our London office.

Qualifications

Solid experience with application debugging in at least one of: Java, Golang, or Python.
A good grasp of PostgreSQL — enough to run queries, analyse data, and perform safe fixes.
Familiarity with Kubernetes and modern cloud platforms (AWS, GCP, or Azure).
Understanding of incident management, observability tools (Grafana, Prometheus, etc.)
A mindset focused on reliability, quality, and ownership.

Benefits

Enhanced leave - 38 days inclusive of 8 UK Public Holidays
Private Health Care including family cover
Life Assurance – 5x salary
Flexible working - work from home and/or in our London Office
Employee Assistance Program
Company Pension (Salary Sacrifice options available)
Access to training and development
Buy and Sell holiday scheme
The opportunity for “work from anywhere/global mobility”

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.