Aktiviere Job-Benachrichtigungen per E-Mail!

Member of Technical Staff - Training Cluster Engineer

Black Forest Labs Inc.

Freiburg im Breisgau

Vor Ort

EUR 70.000 - 90.000

Vollzeit

Vor 30+ Tagen

Erstelle in nur wenigen Minuten einen maßgeschneiderten Lebenslauf

Überzeuge Recruiter und verdiene mehr Geld. Mehr erfahren

Zusammenfassung

A leading technology startup in Freiburg im Breisgau is seeking an experienced ML Cluster Manager to design and maintain large-scale training clusters using SLURM. This role requires expertise in managing GPU clusters, Docker, and scripting skills in Python or Bash. The ideal candidate will collaborate with research teams and establish best practices for infrastructure security and performance.

Qualifikationen

  • Production experience managing SLURM clusters including job scheduling policies.
  • Proven track record managing GPU clusters including driver management.
  • Strong knowledge in scripting for infrastructure management.

Aufgaben

  • Design, deploy, and maintain large-scale ML training clusters.
  • Implement monitoring systems with automated failure detection.
  • Collaborate with ML research teams to align infrastructure capabilities.

Kenntnisse

Managing SLURM clusters at scale
Docker experience
Managing GPU clusters
Scripting skills (Python, Bash)

Tools

Docker
Kubernetes
Enroot/Pyxis
InfiniBand
Jobbeschreibung
Overview

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our large GPU training clusters.

Role & Responsibilities
  • Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
  • Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
  • Partner with cloud and colocation providers to ensure cluster availability and performance
  • Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
  • Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
  • Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning
Required Experience
  • Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation
  • Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments
  • Proven track record managing GPU clusters, including driver management and DCGM monitoring
Preferred Qualifications
  • Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization
  • Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments
  • Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training
  • Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns
  • Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
  • Experience running hybrid training/inference infrastructure with appropriate resource isolation
  • Strong scripting skills (Python, Bash) and infrastructure-as-code experience
Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.
eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.