Enable job alerts via email!

Service Reliability Eng

Universal Music

Greater London

On-site

GBP 60,000 - 80,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading global music company in Greater London is seeking a Systems Reliability Engineer to ensure the availability and performance of critical services. The role involves designing efficient systems, automating operational tasks, and enhancing CI/CD pipelines. The ideal candidate will have a strong background in systems administration and programming, particularly with AWS. A collaborative environment awaits, and diversity of thought is valued.

Qualifications

Strong background in systems administration (Linux/Windows) in a large-scale environment.
Proficiency in at least one programming language (Python, Go, Java).
Hands-on experience with a major cloud platform, preferably AWS.
Solid understanding of networking, containers (Docker/Kubernetes), and Infrastructure as Code.

Responsibilities

Design, build, and maintain the performance of critical services.
Develop and maintain monitoring and observability systems.
Monitor infrastructure capacity and provide suggestions for improvement.
Drive automation of operational tasks and maintain CI/CD pipelines.
Participate in an on-call rotation for incident management.

Skills

Systems administration

Linux

Windows

Python

Java

AWS

Monitoring tools

Networking

Containers

Education

Bachelor's degree in an IT-related field

Tools

AWS CloudWatch

Docker

Kubernetes

Terraform

Ansible

Prometheus

Grafana

Datadog

Splunk

Dynatrace

Music is Universal

Its the passionate and dedicated team at Universal Music who help make us the worlds leading music company. From A&R to finance legal to digital sales to marketing Universal Music is the place to grow and develop your career within a truly commercial and innovative business that leads in everything it does.

Everyone is welcome to apply for our roles and we are determined to ensure that no applicant or employee receives less favourable treatment because of gender race disability sexual orientation religion belief age marital status background pregnancy or caring responsibilities. We also recognise the importance of diversity of thought within our teams and are fully committed to embracing the talents of people with autism dyslexia ADHD and other forms of neurocognitive variation.

We will always seek to make appropriate adjustments to recruitment workplaces and work processes to be fully inclusive to people with different needs and working styles. If you need us to make any reasonable adjustments for you from application onwards including alternatives to the online form or to disclose a neurocognitive condition please email

Job Summary

We are UMG the Universal Music Group. We are the worlds leading music everything we do we are committed to artistry innovation and entrepreneurship. We own and operate a broad array of businesses engaged in recorded music music publishing merchandising and audiovisual content in more than 60 countries. We identify and develop recording artists and songwriters and we produce distribute and promote the most critically acclaimed and commercially successful music to delight and entertain fans around the world.

Job Functions

Key Responsibilities

System Reliability & Performance:

Design build and maintain the availability scalability and performance of critical services.

Develop and maintain robust monitoring alerting and observability systems (e.g. using AWS CloudWatch Dynatrace) to ensure rapid issue detection and resolution.

Monitor infrastructure capacity and performance providing analysis and suggestions for service delivery improvement.

Automation & Efficiency

Drive the automation of repetitive operational tasks including infrastructure provisioning deployments and scaling.

Create and maintain scripts and custom code to support and enhance our operational toolset.

Support and optimize CI / CD pipelines to improve deployment speed and reliability.

Incident Management & Collaboration

Participate in an on-call rotation to troubleshoot and mitigate production incidents.

Lead post-incident reviews and root cause analyses to implement lasting solutions.

Partner with engineering and IT stakeholders to embed SRE best practices (SLOs error budgets) into the design and development lifecycle.

Job Requirements

Required Experience & Skills

A strong background in systems administration (Linux / Windows) in a large-scale environment.

Proficiency in at least one programming language (e.g. Python Go Java).

Hands-on experience with a major cloud platform (AWS GCP or Azure) with a high preference for AWS.

Solid understanding of networking containers (Docker Kubernetes) and Infrastructure as Code (e.g. Terraform Ansible).

Experience with modern monitoring and observability tools (e.g. Prometheus Grafana Datadog Splunk Dynatrace).

Proven analytical and problem-solving abilities with experience in a high-pressure environment.

Excellent communication skills and the ability to foster a collaborative team environment.

Preferred Experience & Skills

Bachelors degree in an IT-related field.

Experience managing large-scale distributed systems for a global organization.

Familiarity with IT governance standards like ITIL.

Direct experience with ServiceNow for IT service management.

Knowledge of chaos engineering resilience testing and advanced capacity planning.

Just So You Know

The company presents this job description as a guide to the major areas and duties for which the jobholder is accountable. However the business operates in an environment that demands change and the jobholders specific responsibilities and activities will vary and develop. Therefore the job description should be seen as indicative and not as a permanent definitive and exhaustive statement.

Job Category

Universal Music Group

Key Skills

Kubernetes,FMEA,Continuous Improvement,Elasticsearch,Go,Root cause Analysis,Maximo,CMMS,Maintenance,Mechanical Engineering,Manufacturing,Troubleshooting

Employment Type: Full-Time

Experience: years

Vacancy: 1

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs