Job Search and Career Advice Platform

Enable job alerts via email!

Platform Site Reliability Engineer at AI infrastructure platform startup

Jack & Jill/External ATS

Remote

GBP 70,000 - 90,000

Full time

Yesterday
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A fast-growing AI infrastructure platform startup is looking for a Platform Site Reliability Engineer to enhance an AI infrastructure platform. The role involves deploying and optimizing Kubernetes for AI workloads, ensuring system stability, performance, and security in a 24/7 production environment. The ideal candidate will have extensive experience in performance-critical environments and strong Linux expertise. This position offers a chance to work at the forefront of AI infrastructure in a well-funded startup.

Qualifications

  • 5+ years’ experience in performance-critical SRE environments with 24/7 operations.
  • 3+ years’ hands-on experience deploying and running orchestration platforms.
  • Expert-level Linux administration, especially Ubuntu.

Responsibilities

  • Deploy, operate, and scale Kubernetes clusters for AI-centric workloads.
  • Optimize Linux systems and build automation for platform lifecycle management.
  • Maintain observability and reliability in 24/7 production environments.

Skills

Kubernetes expertise
Linux administration
System tuning skills
Networking fundamentals

Tools

Prometheus
Grafana
Job description

This is a job that we are recruiting for on behalf of one of our customers.

To apply, speak to Jack. He's an AI agent that sends you unmissable jobs and then helps you ace the interview. He'll make sure you are considered for this role, and help you find others if you ask.

Platform Site Reliability Engineer

Company Description

A fast-growing AI infrastructure platform startup building the backbone for next-generation AI workloads, connecting software and hardware at scale in a highly technical, mission-critical environment.

Job Description

As a Platform Site Reliability Engineer, you will own and evolve a highly available AI infrastructure platform, ensuring stability, security, and performance across bare-metal, virtualization, and orchestration layers. You’ll deploy and optimize Kubernetes for AI workloads, drive automation, manage incidents, and mentor others while supporting a 24/7 production environment.

Location

Gloucestershire, UK

Why this role is remarkable
  • Work at the forefront of AI infrastructure, bridging hardware and software for cutting-edge AI workloads
  • Operate and scale complex bare-metal, virtualized, and Kubernetes-based platforms
  • Make a meaningful impact on reliability, automation, and team capability within a well-funded startup
What you will do
  • Deploy, operate, and scale Kubernetes clusters supporting AI-centric workloads
  • Optimize Linux systems and build automation for platform lifecycle management and incident response
  • Maintain observability and reliability using tools such as Prometheus and Grafana in 24/7 production environments
The ideal candidate
  • 5+ years’ experience in globally scaled, performance-critical SRE environments with 24/7 operations
  • 3+ years’ hands‑on experience deploying and running orchestration platforms, with deep Kubernetes expertise
  • Expert-level Linux administration (especially Ubuntu), strong system tuning skills, and solid networking fundamentals
How to Apply

To apply for this job speak to Jack, our AI recruiter.

Step 1. Visit our website
Step 2. Click 'Speak with Jack'.
Step 3. Login with your LinkedIn profile.
Step 4. Talk to Jack for 20 minutes so he can understand your experience and ambitions
Step 5. If the hiring manager would like to meet you, Jack will make the introduction

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.