Enable job alerts via email!

Lead Site Reliability Engineer

loveholidays

London

On-site

GBP 50,000 - 90,000

Full time

9 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative online travel agency is seeking a Site Reliability Engineer to enhance their SRE practices and ensure system reliability. In this pivotal role, you will promote best practices, improve performance testing, and develop tools to support a high-load environment. The company values observability and scalability, leveraging cutting-edge cloud technologies. With a commitment to employee development and well-being, this position offers a supportive environment for personal and professional growth. Join a forward-thinking team dedicated to transforming the travel industry through technology.

Benefits

Company pension contributions at 5%
Training budget
Discounted holidays
25 days of annual leave plus 8 public holidays
Option to buy or sell annual leave
Cycle-to-work scheme
Season ticket loans
Eye care vouchers

Qualifications

  • Experience in Site Reliability Engineering and best practices.
  • Strong understanding of observability and performance principles.

Responsibilities

  • Enhance SRE practices and improve reliability KPIs.
  • Develop tools with reliability and performance in mind.
  • Perform low-level debugging and troubleshooting.

Skills

Site Reliability Engineering
Performance Testing
Incident Management
Observability
Debugging

Education

Degree in Computer Science or related field

Tools

Prometheus
Grafana
Loki
Tempo
Java Flight Recorder
Go’s pprof
Linkerd

Job description

We are a rapidly growing online travel agency with technology at the core of our success. In 2022, we facilitated millions of people on their dream holidays.

Handling a million visitors daily, our platform supports over 100 services, processing 8,000 requests per second, with a p95 search latency of 150ms. Our observability infrastructure captures and processes 1TB of logs daily and 350,000 metric samples per second.

We emphasize differentiation through open source contributions, including open sourcing internal tools, contributing to public repositories, and sponsoring conferences.

Responsibilities

As our first Site Reliability Engineer, you will help evolve SRE practices such as incident management, blameless postmortems, SLOs, and error budgets. Your role will involve building reliable, performant, auto-scalable, and highly available systems with support from the existing Platform Infrastructure team.

  • Enhance SRE practices across teams.
  • Improve reliability KPIs of the platform.
  • Balance reliability with feature delivery using SLOs and error budgets.

Our engineering teams manage the entire lifecycle of services from initial development to high-load production operation. Your responsibility is to enable engineering teams to succeed in operations, not to run their services for them.

What you'll be working on
  • Kick-start our SRE function by promoting reliability best practices and processes.
  • Identify slow code paths in critical applications using tools like Java Flight Recorder or Go’s pprof.
  • Develop or modify tools and applications with reliability and performance in mind.
  • Ensure systems can handle ten times the current load by improving performance testing.
  • Reduce mean time to discovery and recovery through enhanced observability and alerting.

We focus heavily on observability, continuously evolving our monitoring and alerting stack centered around the Mimir ecosystem (Prometheus, Grafana, Loki, Tempo). Our service mesh (Linkerd) provides uniform observability of all production services at 10-second intervals.

Performance and scalability are fundamental to our development process, achieved by combining core computer science principles with cutting-edge cloud technologies.

  • Perform low-level debugging and troubleshooting.
What we'll give back to you
  • Company pension contributions at 5%.
  • Training budget to support your ongoing learning and development.
  • Discounted holidays for you, your family, and friends.
  • 25 days of annual leave plus 8 public holidays, increasing by 1 day every two years up to 30 days.
  • Option to buy or sell annual leave.
  • Cycle-to-work scheme, season ticket loans, and eye care vouchers.
About the company

loveholidays offers a personalized approach to searching for your next getaway, allowing you to customize your holiday with maximum flexibility. Rest assured, your holiday is ATOL protected. We offer various payment options to ensure a seamless booking experience.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Lead Site Reliability Engineer

TN United Kingdom

London

On-site

GBP 60,000 - 100,000

Today
Be an early applicant

Lead Platform Architect (m/f/d)-AI

TN United Kingdom

Greater London

Remote

GBP 70,000 - 110,000

5 days ago
Be an early applicant

Lead Site Reliability Engineer

Signify Technology

Greater London

On-site

GBP 65,000 - 95,000

10 days ago

Lead Site Reliability Engineer

Board Intelligence Limited

London

On-site

GBP 60,000 - 100,000

2 days ago
Be an early applicant

Lead Site Reliability Engineer

JR United Kingdom

Greater London

On-site

GBP 60,000 - 100,000

5 days ago
Be an early applicant

Lead Site Reliability Engineer

JR United Kingdom

London

On-site

GBP 60,000 - 100,000

3 days ago
Be an early applicant

Lead Site Reliability Engineer

ZipRecruiter

London

On-site

GBP 60,000 - 100,000

4 days ago
Be an early applicant

3x Lead Reliability Engineer - Kent, Hampshire and Sussex

TN United Kingdom

Greater London

Hybrid

GBP 50,000 - 63,000

5 days ago
Be an early applicant

Lead Site Reliability Engineer

Boehringer Ingelheim

Guildford

On-site

GBP 60,000 - 100,000

9 days ago