Enable job alerts via email!

Site Reliability Engineer

bet365

Stoke-on-Trent

Hybrid

GBP 50,000 - 80,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a skilled Site Reliability Engineer to enhance system reliability and performance. In this role, you will leverage your software engineering skills to monitor and improve the health of critical systems, ensuring operational efficiency. Collaborating across functions, you will implement best practices in observability and reliability, driving initiatives that contribute to a culture of continuous improvement. This position offers the opportunity to work in a hybrid environment, making a significant impact on the organization's service performance and user satisfaction.

Qualifications

Excellent knowledge of Site Reliability Engineering principles.
Experience with Infrastructure as Code (IaC) and automation tools.

Responsibilities

Writing code to enhance reliability and observability of services.
Developing tools for effective management of systems.
Mentoring colleagues in new technologies and practices.

Skills

Site Reliability Engineering principles

Service Level Indicators (SLI)

Service Level Objectives (SLO)

Observability tools (Splunk, New Relic, Grafana, PagerDuty)

Infrastructure as Code (IaC)

Shell scripting

Automation and orchestration

Tools

Ansible

Terraform

Grafana

Splunk

New Relic

PagerDuty

bet365 Stoke-On-Trent, England, United Kingdom

Site Reliability Engineer

bet365 Stoke-On-Trent, England, United Kingdom

Direct message the job poster from bet365

A Site Reliability Engineer who will enhance system reliability, observability, and performance through a strong engineering approach, and assist with incident resolution and best practices.

You will have software engineering skills, focusing on system reliability and observability. You will monitor the health, performance, and availability of critical systems, directly impacting operational efficiency.

Using your engineering expertise, you will implement solutions that enhance reliability, including service instrumentation with tools such as Open Telemetry, improve logging practices, and develop features for maintainability. You will also help engineer tools and automation for effective service management.

Collaboration is key, working across multiple functions to integrate reliability and observability best practices into the software development lifecycle. By supporting governance standards set by the central teams, you will foster a culture where these principles are integral to development. Your contributions will ensure our systems meet user demands and enhance overall service performance.

This role is eligible for inclusion in the Company’s hybrid working from home policy.

Preferred skills and experience

Excellent knowledge of Site Reliability Engineering principles, including the creation and management of effective Service Level Indicators (SLI) and Service Level Objectives (SLO) for reliability and customer satisfaction.
Knowledge of contemporary observability tools, techniques, and best practices including Splunk, New Relic, Grafana, and PagerDuty.
Knowledge and experience of modern software development techniques and lifecycles.
Experience with Infrastructure as Code (IaC) automation and orchestration tools such as Ansible and Terraform.
Prior experience working in a large-scale, 24/7 enterprise where system uptime and stability are of paramount importance to the business.
Keen interest in industry trends, particularly Platform Engineering.
Proficiency in shell scripting for automation and system management tasks.

Main Responsibilities

Writing and contributing to code that enhances the reliability and observability of services, including telemetry, operational APIs, and tooling.
Developing and maintaining tools that facilitate effective management of our systems, ensuring they are operationally efficient and resilient.
Working with automation and orchestration platforms to automate manual activities and reduce toil.
Building sophisticated dashboards using telemetry data and dashboarding technologies like Grafana, Splunk, and New Relic.
Maintaining and administering existing monitoring and analytic toolsets.
Mentoring colleagues in the use of new technologies or practices.
Actively participating in live incident resolution and post-mortem analysis, providing effective remediation strategies to improve overall system health and prevent future issues.
Driving initiatives to enhance system reliability and observability, contributing to a culture of continuous improvement.
Collaborating with the central Site Reliability Engineering and Observability teams to establish and uphold standards for reliability and observability, assisting teams in adhering to these practices.
Working with IT Operations, providing and supporting the use of critical tooling to enable increasing levels of value to the business.

By applying to us, you agree to share your Personal Data in accordance with our Recruitment Privacy Policy - http://www.bet365careers.com/privacypolicy.pdf.

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Information Technology

Industries

Gambling Facilities and Casinos

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs