Enable job alerts via email!

Operations Site Reliability Engineer

JR United Kingdom

Bristol

On-site

GBP 50,000 - 90,000

Full time

15 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking an Operations Site Reliability Engineer to tackle demanding technical challenges in a fast-paced environment. This role is pivotal in ensuring the reliability and performance of critical production services, directly impacting thousands of users. You will work closely with engineering teams and external partners, employing your expertise in Linux systems, cloud platforms, and automation to solve complex problems. If you thrive under pressure and are passionate about innovation, this is your chance to make a significant impact in a rapidly evolving field.

Qualifications

5+ years of experience administering Linux systems.
2+ years operational experience with AWS or Google Cloud Platform.
Proficiency in at least one scripting language.

Responsibilities

Monitor and ensure the performance of production services.
Drive automation to reduce manual tasks and improve performance.
Create and maintain troubleshooting documentation.

Skills

Linux Administration

AWS

Google Cloud Platform

Automation Platforms

Ansible

Terraform

Scripting (Perl, Shell, Ruby, Python)

Networking Knowledge

Troubleshooting Skills

Education

Degree in Systems Engineering

Degree in Computer Science

Tools

Ansible Tower

Jenkins

Docker

Kubernetes

Social network you want to login/join with:

Operations Site Reliability Engineer, Bristol

Client: Recognition One

Location: Bristol, United Kingdom

Job Category: Other

EU work permit required: Yes

Job Views: 4

Posted: 08.05.2025

Expiry Date: 22.06.2025

Job Description:

Face a variety of demanding technical challenges across diverse disciplines, working directly with one of our largest and most influential clients to make a significant impact. This unique opportunity will unveil new possibilities in a rapidly evolving field. Are you an expert troubleshooter with a passion for innovation? This could be your chance.

The position is the last line of infrastructure support, far beyond technical customer support. It’s about solving the trickiest problems in the business that directly impact thousands of users within the largest global companies. You’ll often coordinate with product engineering and external partners like Google Cloud, as well as write automation and documentation to enable others to fix recurring problems.

Primary Responsibilities:

Be part of a critical operations team responsible for monitoring, availability, and performance of production services.
Respond to stakeholder requests within agreed timescales or SLOs.
Drive automation to reduce failures, manual tasks, and improve overall application performance and availability.
Perform systems administration activities to ensure smooth operation of applications across multiple platforms.
Coordinate and communicate with impacted stakeholders per incident management process.
Demonstrate ownership of events and incidents until resolution.
Conduct daily shift handovers to peers and management across multiple geographies.
Support maintenance activities impacting production applications.
Support critical systems handling sensitive and proprietary data.
Create, maintain, and update troubleshooting and support documentation.
Contribute to planning of application/infrastructure releases and configuration changes.
Administer and maintain all production environments.
Patching and upgrading existing applications.
Provide feedback and coaching to upstream teams (internal and vendors) to reduce escalations and improve customer experience.

Professional Experience Required:

A degree in Systems Engineering, Computer Science, or related fields with relevant experience preferred.
5+ years of experience administering Linux systems.
Hands-on experience with various Linux distributions.
2+ years operational experience with AWS or Google Cloud Platform.
Experience with automation platforms to automate repetitive tasks.
Familiarity with deployment tools such as Ansible Tower and Jenkins.
Experience deploying to large, global infrastructure.
Proficiency with orchestration/configuration tools like Ansible and Terraform.
Strong knowledge of networking, packet tracing, latency, and throughput issues.
Thorough understanding of HTTP(S), SMTP, TLS/SSL, DNS, LDAP, Kubernetes, and Docker.
Experience in system/application administration in high-availability, large-scale environments.
Proficiency in at least one scripting language such as Perl, shell, Ruby, or Python.
Experience tuning and optimizing monitoring systems.

Personal Skills:

A strong team player, quick to learn new technologies, adaptable, with a focus on delivery.
Excellent troubleshooting and problem-solving skills.
Ability to work calmly under pressure.
Interest in security.
Effective communicator at all organizational levels.

This role includes participation in weekend and holiday on-call support as required.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs