Enable job alerts via email!

Site Reliability Engineer

IGT Solutions

Greater London

On-site

GBP 80,000 - 100,000

Full time

Today

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in IT services is seeking a Site Reliability Engineer to ensure high product performance and stability. The role involves managing operational incidents, implementing automation, and collaborating with development teams to enhance efficiency. Ideal candidates will have extensive experience in IT operations, with a strong focus on automation and incident management. This is an excellent opportunity for professionals looking to advance their careers in a dynamic environment.

Qualifications

8+ years of experience in IT operations or service management.
Experience with CI/CD pipelines and automation.
Strong background in managing high-availability systems.

Responsibilities

Define, build, and maintain support systems for high availability.
Implement automation for system provisioning and monitoring.
Conduct thorough problem investigations and root cause analyses.

Skills

Automation

Incident Management

Root Cause Analysis

Collaboration

Education

Bachelor's degree in Computer Science

Master’s degree (preferred)

Tools

Linux Administration

Kubernetes

AWS

Azure

Google Cloud

2 days ago Be among the first 25 applicants

Direct message the job poster from IGT Solutions

Responsible for the proactive support of products so that there is high product performance that is continuously improved. Responsible for identifying and resolving the root causes of operational incidents, implementing solutions to improve stability and prevent recurrence. Manages the creation and maintenance of the event catalog to trigger events and develops both manual remediation approaches and automated workflows to resolve alerts. Oversees the deployment of IT services and solutions, ensuring successful integration with minimal disruption. Focuses on operational automation and integration to enhance efficiency and collaboration between development and operations within service operations.

Key Responsibilities

Site Reliability Engineer

Define, build, and maintain support systems to ensure high availability and performance.
Handle complex cases for the Operations team.
Build events to add to the event catalog for the relevant product or application.
Implement automation for system provisioning, self-healing, auto recovery, deployment, and monitoring.
Perform incident response and root cause analysis for critical system failures.
Monitor system performance and establish service-level indicators (SLIs) and objectives (SLOs).
Collaborate with development and operations to integrate reliability best practices, including moving to zero downtime architecture.
Proactively identify and remediate performance issues.
Work closely with Product, Software & Infra Engineering and Service support architects for new product productization
Ensure Operations readiness to support new products
Coordinate with internal and external stakeholders for feedback for continual service improvement for inscope products & drive plan till successful closure
Accountable for the in scope product to ensure high availability performance.
Conduct thorough problem investigations and root cause analyses (RCA) to diagnose recurring incidents and service disruptions
Coordinate with incident management teams, operations experts and collaborate with different Service Operations and Engineering teams to develop and implement permanent solutions.
Monitor the effectiveness of problem resolution activities, provide regular reports on problem management activities, and ensure continuous improvement.

Event Management

Define and maintain an event catalog, specifying active events, thresholds, and relevant remediation, and optimize it for efficiency.
Develop event response protocols, provide training to teams, and ensure quick and efficient handling of incidents.
Collaborate with stakeholders to define events, ensure coverage across the Service Operations, and drive improvements based on post-event reviews and feedback.

Deployment Management

Own the quality of new release deployment for the Service Operations, ensuring a clear process and responsibilities are assigned for smooth implementation.
Develop and maintain deployment schedules, conduct operational readiness assessments, and manage deployment risk assessments to ensure service stability.
Oversee the execution of deployment plans, coordinate resources & process with delivery and lifecycle engineering, communicate with stakeholders, and continuously work with different stakeholders to improve deployment processes based on feedback.

DevOps/NetOps Management

Manage continuous integration and deployment (CI/CD) pipelines, ensuring smooth integration between development and operational teams.
Automate operational processes, monitor system performance, and resolve issues related to automation scripts to increase efficiency.
Implement and manage infrastructure as code, provide ongoing support for automation tools, and continuously improve DevOps practices.

Education and Professional Qualifications

Educational BackgroundBachelor's degree in Computer Science, Information Technology, Engineering, or a related field.
Advanced degree (Master’s or equivalent) is often preferred for senior positions.

Qualifications

Relevant certifications such as Linux Administration, Certified Kubernetes Administrator (CKA)
Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies (e.g., Certified DevOps Professional)

Experience

8+ years of experience in IT operations, service management, or infrastructure management, including roles such as Site Reliability Engineer, or DevOps lead
Proven experience in managing high-availability systems and ensuring operational reliability
Extensive experience in root cause analysis (RCA), incident management, and developing permanent solutions for recurring service disruptions.
Hands-on experience with CI/CD pipelines, automation, system performance monitoring, and the implementation of infrastructure as code.
Strong background in collaborating with cross-functional teams (development, operations, engineering, etc.) to improve operational processes and service delivery.
Experience in managing deployments, risk assessments, and optimizing event and problem management processes.
Familiarity with cloud technologies, containerization, and scalable architecture, including experience with zero-downtime deployment strategies.

Disclaimer: IGT Solutions provides equal employment opportunities to all individuals based on job-related qualifications and ability to perform a job, without regard to age, gender, gender identity, sexual orientation, race, color, religion, creed, national origin, disability, genetic information, veteran status, citizenship or marital status, and to maintain a non-discriminatory environment free from intimidation, harassment or bias based upon these groups.

Seniority level

Seniority level
Mid-Senior level

Employment type

Employment type
Contract

Job function

Job function
Information Technology
Industries
IT Services and IT Consulting

Referrals increase your chances of interviewing at IGT Solutions by 2x

Get notified about new Site Reliability Engineer jobs in London Area, United Kingdom.

Systems Engineer - Junior-Mid - Systematic Hedge Fund - £150k

London, England, United Kingdom 2 weeks ago

DevOps Engineer - Systematic Hedge Fund - £250k

London, England, United Kingdom 1 week ago

London, England, United Kingdom 5 days ago

London, England, United Kingdom 1 week ago

London, England, United Kingdom 1 month ago

Systems Engineer - Systematic Hedge Fund - £150k

Greater London, England, United Kingdom 6 days ago

London, England, United Kingdom 2 days ago

London, England, United Kingdom 2 weeks ago

Site Reliability Engineer, Traffic Platform

London, England, United Kingdom 3 weeks ago

London, England, United Kingdom 1 month ago

City Of London, England, United Kingdom £80,000.00-£100,000.00 1 week ago

London, England, United Kingdom 21 hours ago

London, England, United Kingdom 2 weeks ago

Greater London, England, United Kingdom 1 month ago

Watford, England, United Kingdom 5 months ago

London, England, United Kingdom 1 month ago

Software Engineer, All Levels - London & Lisbon

Greater London, England, United Kingdom 1 week ago

London, England, United Kingdom 2 hours ago

London, England, United Kingdom 2 weeks ago

London, England, United Kingdom 1 week ago

London, England, United Kingdom 1 day ago

London, England, United Kingdom 2 weeks ago

Production Support & Platform Engineer / Application SRE – Integration & Project - Elite FinTech - £90,000-£170,000 + Bonus

London, England, United Kingdom 1 day ago

London, England, United Kingdom 4 weeks ago

Basildon, England, United Kingdom 1 day ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Site Reliability Engineer

Auros

Greater London

Remote

GBP 60,000 - 100,000

15 days ago