Enable job alerts via email!

Observability Platform Engineer (SRE Focus)

YouLend

City Of London

On-site

GBP 70,000 - 90,000

Full time

Today

Be an early applicant

Job summary

A leading fintech company in Central London is seeking a dedicated Observability Engineer to enhance system reliability and developer experience. Responsibilities include building monitoring systems within Kubernetes, managing SLOs, and implementing chaos engineering practices. The ideal candidate will have strong experience with Datadog and Terraform. Join us for competitive benefits and award-winning workplace culture.

Benefits

Stock Options

Private Medical insurance via Vitality

Enhanced Maternity and Paternity Leave

Free Gym in office building

Subsidised Lunch via Feedr

Monthly in office Masseuse

Team and Company Socials

Qualifications

Have production experience with observability tools in cloud-native environments.
Have set up monitoring and alerting across Kubernetes services.
Have built or scaled on-call systems in startups or large-scale environments.

Responsibilities

Design and scale on-call systems that engineers don’t dread.
Build Datadog monitoring, alerting, and dashboards.
Define and manage SLOs, SLIs, and error budgets.

Skills

Production experience with observability tools (especially Datadog)

Setting up monitoring and alerting across Kubernetes

Building or scaling on-call systems

Reducing alert fatigue

Experience with infrastructure as code (Terraform preferred)

Curiosity about chaos engineering

Tools

Datadog

Terraform

Kubernetes

OpenTelemetry

Fluent Bit

Overview

The Mission: We’re building a world-class Observability function, and we’re looking for someone who lives for uptime, meaningful alerts, and elegant dashboards. If you’ve ever been on-call, silenced a noisy monitor, or traced a ghost bug across microservices outside core hour - we want to hear from you!

This isn’t a generic “Platform Engineer” role. You’ll be laser-focused on observability, reliability, and developer empowerment, working closely with teams to make sure we don’t just know when things break - but why.

Responsibilities

Designing and scaling on-call systems that engineers don’t dread being part of.
Building out Datadog monitoring, alerting, dashboards, and log pipelines for our Kubernetes-based environments.
Defining and managing SLOs, SLIs, and error budgets — and helping teams stick to them.
Creating scorecards and software catalogs so engineers know what’s healthy, what’s broken, and who owns what.
Training and enabling dev teams to own their own observability, alerts, and incident response.
Introducing chaos engineering practices (yes, we want to break things… on purpose).
Driving a culture of reliability, with incident reviews, shared learnings, and transparency.

Qualifications

Have production experience with observability tools (especially Datadog) in cloud-native environments.
Have set up monitoring and alerting across Kubernetes services.
Have built or scaled on-call systems in startups or large-scale environments.
Know how to reduce alert fatigue and love a good MTTR chart.
Have experience with infrastructure as code (Terraform preferred).
Believe that great developer experience includes clear visibility and ownership.
Are curious about — or already practicing — chaos engineering.

Bonus Points

Experience with OpenTelemetry, Fluent Bit, or similar.
Familiarity with service catalog tooling (e.g., Backstage).
Comfortable running or facilitating game days or failure drills.
Prior involvement in setting up scorecards for service health.

What This Role Isn’t

This is not a traditional platform or infra role.
You won’t be spending your days tweaking CI/CD pipelines or setting up VPCs.
We’re looking for someone obsessed with how systems behave in production — not just how they’re deployed.

The Stack

Cloud: AWS (EKS, Lambda, etc.)
Observability: Datadog, OpenTelemetry
Infra as Code: Terraform
Orchestration: Kubernetes (EKS)
Logging: Fluent Bit, FireLens
Catalogs/Scorecards: Backstage (or custom)

Apply Now

If this sounds like your kind of role, we’d love to hear from you.

Drop us a message with your CV and a note about the coolest monitoring setup or incident resolution you’ve ever worked on.

Why join YouLend?

Award-Winning Workplace: YouLend has been recognised as one of the “Best Places to Work 2024 & 2025” by the Sunday Times for being a supportive, diverse, and rewarding workplace.
Award-Winning Fintech: YouLend has been recognised as a “Top 250 Fintech Worldwide” company by CNBC.

We offer a comprehensive benefits package that includes:

Stock Options
Private Medical insurance via Vitality
EAP with Health Assured
Enhanced Maternity and Paternity Leave
Modern and sophisticated office space in Central London
Free Gym in office building in Holborn
Subsidised Lunch via Feedr
Deliveroo Allowance if working late in office
Monthly in office Masseuse
Team and Company Socials
Football Power League / Squash Club

At YouLend, we champion diversity and embrace equal opportunity employment practices. Our hiring, transfer, and promotion decisions are exclusively based on qualifications, merit, and business requirements, free from any discrimination based on race, gender, age, disability, religion, nationality, or any other protected basis under applicable law.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs

Observability Platform Engineer (SRE Focus)

YouLend

City Of London

On-site

GBP 70,000 - 90,000