Enable job alerts via email!

Operations Site Reliability Engineer

ZipRecruiter

Bristol

On-site

GBP 50,000 - 70,000

Full time

11 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking an expert troubleshooter to tackle demanding technical challenges within a critical operations function. This role involves supporting sensitive systems, driving automation, and ensuring high availability of applications across multiple platforms. Join a team that values innovation and problem-solving in a fast-paced environment.

Qualifications

5+ years of experience administering Linux systems.
2+ years Operational experience with AWS or Google Cloud.
Experience with automation platforms.

Responsibilities

Monitor availability and performance of production services.
Drive automation to reduce manual tasks.
Coordinate with stakeholders during incidents.

Skills

Troubleshooting

Problem Solving

Communication

Education

Degree in Systems Engineering

Degree in Computer Science

Tools

Ansible

Terraform

Jenkins

Docker

Kubernetes

Job Description

Face a variety of demanding technical challenges across diverse disciplines, working directly with one of our largest and most influential clients to make a significant impact. This unique opportunity will unveil new possibilities in a rapidly evolving field. Are you an expert troubleshooter with a passion for innovation? This could be your chance.

The position is the last line of infrastructure support, way beyond technical customer support. It’s all about solving the trickiest problems in the business that directly impact 1000s of users within the largest global companies. You’ll often co-ordinate with product engineering and external partners like Google Cloud as well as write automation and documentation to allow others that fix problems that appear more than once.

The primary responsibilities include:

To form part of a critical operations function that is responsible for the monitoring, availability and performance of production services.
Responding to stakeholder requests within agreed timescales or SLO
Drive automation to reduce failures, manual tasks and therefore improving overall application performance and availability.
Perform systems administration activities to ensure the smooth operation of applications across multiple platforms
Coordinate and communicate with impacted stakeholders as per incident management process.
Demonstrate ownership of events and incidents through to restoration
Perform daily shift handovers to peers and management across multiple geographies.
Support maintenance activities which impact production applications.
Support critical systems that handle sensitive and proprietary data
Create, maintain and update work instructions for troubleshooting and supporting applications.
Contribute to the planning of application/infrastructure releases and configuration changes
Provide input to administering and maintaining all production environments
Patching and upgrade of existing applications
Provide feedback and coaching to upstream teams (both internal and vendors) to reduce escalations and to continually improve overall experience for customers.

Professional Experience Required

A degree in Systems Engineering, Computer Science or related fields with related experience
5+ years of experience administering Linux systems
Strong hands-on experience of variants of Linux distros
2+ years Operational experience of working with Amazon Web Services or Google Cloud Platform
Experience of working with an automation platform to automate repetitive actions that reduce manual effort
Familiarity with deployment tools such as Ansible Tower and Jenkins
Experience in carrying out large deployments to global infrastructure
Proficient with orchestration/configuration tools such as Ansible and Terraform
Strong working knowledge of networking, packet tracing, and understanding latency and throughput in order to pinpoint or resolve application issues.
Thorough knowledge of HTTP(S), SMTP, TLS/SSL, DNS, LDAP, Kubernetes and Docker containers
Experience of system/application administration in a distributed, customer-facing, high-availability and large-scale environments
Experienced and confident in at least one scripting such as Perl, shell, Ruby or Python.
Experience of tuning and optimising monitoring systems

Personal Experience Required

A strong team player with the ability to grasp new technologies, adapt to change in methodologies, with a focus on delivery
Extensive troubleshooting and problem-solving skills with respect to application technologies
Ability to remain calm and work well under pressure
A keen interest and desire to work within the security arena
Ability to communicate effectively at all levels up to senior management

This role will need to participate in weekends and holidays on-call support as and when required.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs