Job Search and Career Advice Platform

Enable job alerts via email!

Infrastructure Site Reliability Engineer

ORI

England

On-site

GBP 60,000 - 80,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading AI infrastructure company in the United Kingdom is looking for an experienced Infrastructure Site Reliability Engineer. This role involves deploying and operating resilient infrastructure, optimizing Linux systems, and maintaining observability stacks. Ideal candidates will have over 5 years of experience in performance-intensive environments and strong Linux administration skills. Join a culture that values results and innovation while contributing to AI and HPC workloads.

Benefits

30 days of annual leave
Private medical insurance
Cycle to Work Scheme
Gympass subscription
Participation in company shares program

Qualifications

  • 5+ years experience in globally scaled environments.
  • Proficiency in system tuning and disk I/O optimization.
  • Hands-on experience with orchestration platforms.

Responsibilities

  • Deploy and operate resilient, scalable infrastructure.
  • Optimize Linux system configuration.
  • Maintain ORI’s observability stack and service operations.

Skills

Expert-level Linux administration
Strong networking fundamentals
Infrastructure scripting and automation
Deep understanding of observability principles
Excellent communication and mentorship skills

Education

Bachelor or Masters Level degree in Computer Science

Tools

Prometheus
Grafana
Ansible
Kubernetes
Bash
Job description
Company Overview:

Ori Industries is at the forefront of AI infrastructure, revolutionising the connection between software and hardware for the AI era. Our mission is to empower AI teams with scalable, secure, and efficient infrastructure solutions that support seamless model training, deployment, and scaling.

Job Summary:

We’re looking for an experienced Infrastructure Site Reliability Engineer to run and evolve our infrastructure stack. You’ll contribute across bare-metal, virtualization, and orchestration layers,

keeping things stable and secure 24/7 x 365 — all while mentoring teammates, improving process and automation as well as helping translate deep technical concepts for a wide range of collaborators and customers.

What You’ll Do :
  • Deploy and operate resilient, scalable infrastructure supporting AI/HPC workloads

  • Optimize Linux system configuration, BIOS/firmware, kernel, and disk subsystem for performance
  • Configure, monitor and manage bare-metal infrastructure using IPMI, Redfish, etc
  • Build and maintain automation scripts and infrastructure as code to support platform lifecycle, as well as simplifying troubleshooting for Incident resolution and provision of tooling for our support organisation
  • Apply ITSM frameworks: Incident, Major Incident, Change Management, and service improvement.
  • Maintain and enhance ORI’s observability stack: Prometheus, Grafana, and custom monitoring integrations
  • Operate and support services in 24x7 production environments, including on-call rotation
  • Contribute to Incident postmortem analyses, root cause analysis, document learnings, and automate remediations
  • Mentor junior engineers and act as an Operational requirements consultant to other departments
  • Communicate technical decisions clearly to non-technical stakeholders and customers
  • Uphold a culture of: do, document, automate
  • Willingness to cross train with Platform Engineering/Platform SRE to fully support both our infrastructure and platform stacks.
  • Willingness to cross train with HPC Engineering, supported by NVIDIA to enhance our HPC supportability offering
What you bring:
  • 5+ Years Proven experience in globally scaled, performance-intensive environments operating to a 24/7 support model
  • Expert-level Linux administration, especially Ubuntu distributions
  • Proficiency in system tuning, disk I/O optimization, and hardware-level performance tweaks
  • Familiarity with Out of Band management tools (IPMI, Redfish, PXE, etc.)
  • Strong networking fundamentals: TCP/IP, DNS, DHCP, VLANs, routing, switching
  • Strong experience with infrastructure scripting and automation (Bash, Python, Ansible)
  • Deep understanding of observability principles and tools (Prometheus, Grafana)
  • Hands‑on experience operating orchestration platforms (Kubernetes, MAAS, Tinkerbell)
  • Strong grasp of ITSM and service operation best practices
  • Excellent communication and mentorship skills
  • Comfortable interfacing with internal stakeholders and external customers
  • Bonus: Knowledge of HPC workloads and GPU‑based infrastructure
  • Bonus: Experience with InfiniBand networks and HPC performance tuning
Nice to have:
  • Bachelor or Masters Level degree in Computer Science, Engineering or related field, or equivalent experience.
  • LPIC Certifications
  • ITIL Foundation level qualification or equivalent experience
How you work:
  • You approach problems with a systems mindset - balancing practical execution with long-term scalability
  • You elevate the team, setting high standards for technical quality and engineering excellence.
  • You hold yourself and others accountable - giving direct feedback and expecting the same
  • You take initiative, owning challenges end‑to‑end and proactively driving solutions.
  • You invest in others, mentoring to build both capability and confidence.
  • You communicate clearly - translating complexity into clarity across engineering and business audiences
Why should you join us?

What sets us apart is our blend of modern technology, competitive benefits, and an open, welcoming work culture that enables our people to thrive.

Here are just some of the great things you can expect from us:

  • 30 days of annual leave: we value your peace of mind. With 30 days off (excluding public holidays) and access to mental health resources, we make sure you're as strong mentally as you are professionally.
  • A culture that emphasises results over hierarchy, process & ego: we place great emphasis on the quality, ingenuity and creativity of work.
  • Open communication, regular feedback: we value smooth collaboration, direct and actionable feedback, and believe that leading with empathy and a growth mindset makes us better together.
  • Learning Time: we all have dedicated learning time to focus on new skills, projects or interests that lay outside of your day‑to‑day job.
  • Health & Wellbeing: we want everyone to feel healthy and happy, so we offer private medical insurance via Bupa.
  • Cycle to Work Scheme: we're committed to building a sustainable business, so we encourage cycling to work.
  • Gympass subscription to a variety of gyms and wellbeing apps
  • Participation in the company shares program
  • Enhanced parental pay & leave
Diversity, Equity, Inclusion and Belonging

We are an equal opportunity employer and we strive to reduce unconscious bias throughout our hiring process. All applicants will be considered for employment without attention to ethnicity, religion, sexual orientation, gender identity, family or parental status, national origin, veteran, neurodiversity status or disability status. To ensure our recruitment processes provide an equal opportunity for all applicants to succeed, we encourage you to let us know if there are any adjustments that we can make.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.