Job Search and Career Advice Platform

Aktiviere Job-Benachrichtigungen per E-Mail!

Senior Site Reliability Engineer (SRE) - Data Center

Hamilton Barnes Associates Limited

Berlin

Vor Ort

EUR 200.000

Vollzeit

Vor 30+ Tagen

Erstelle in nur wenigen Minuten einen maßgeschneiderten Lebenslauf

Überzeuge Recruiter und verdiene mehr Geld. Mehr erfahren

Zusammenfassung

A stealth-mode AI start-up in Berlin is seeking a Senior Site Reliability Engineer to own the reliability and performance of their GPU-powered infrastructure. This role involves designing large-scale GPU clusters, developing automation pipelines, and collaborating with teams to optimize resource scheduling. The ideal candidate has over 7 years of experience in SRE or DevOps, with strong skills in Kubernetes and Linux systems. This position offers an annual salary of €200,000 and equity as part of the benefits package.

Leistungen

Equity

Qualifikationen

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
  • Strong hands-on experience with Kubernetes and Slurm.
  • Deep knowledge of Linux systems and GPU infrastructure.
  • Proficiency in automation using Python, Go, or Bash.
  • Experience with observability stacks and incident response frameworks.

Aufgaben

  • Design, deploy, and maintain large-scale GPU clusters for workloads.
  • Build automation pipelines for scaling and monitoring compute resources.
  • Develop observability and auto-healing systems for GPU workloads.
  • Collaborate to optimise resource scheduling and data flow.
  • Implement infrastructure-as-code and CI/CD pipelines.
  • Diagnose performance bottlenecks and improve reliability.

Kenntnisse

Experience in SRE, DevOps, or Infrastructure Engineering
Hands-on with Kubernetes
Experience with Slurm
Deep knowledge of Linux systems
Proficiency in Python, Go, or Bash
Experience with observability stacks
Familiarity with HPC or AI/ML infrastructures
Jobbeschreibung

Join a stealth-mode hyperscale data center start-up building an AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready to go for experimentation, full-scale model training, or inference. As a Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

If you are interested in this incredible opportunity, get in touch today! You don't want to miss out!

Responsibilities
  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.
Skills / Must Have
  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands‑on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high‑performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.
Benefits
  • Equity
Salary
  • €200,000 gross per year
Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.
eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.