Enable job alerts via email!

Site Reliability Engineer — GPU Infrastructure

Genmo Inc.

Mississippi

On-site

USD 100,000 - 130,000

Part time

13 days ago

Job summary

A cutting-edge AI research lab is seeking an experienced Site Reliability Engineer to manage GPU clusters and lead Kubernetes operations. This contract position requires expertise in Infrastructure-as-Code and a strong background in cloud environments. Ideal candidates will have 3+ years in a production setting and a passion for advancing video generation technologies.

Qualifications

  • 3+ years in site reliability engineering or DevOps in production.
  • Expert-level experience managing Kubernetes fleets.
  • Hands-on with containerized GPU stacks.
  • Proficient in Infrastructure-as-Code tools.

Responsibilities

  • Design and operate GPU clusters for generative models.
  • Lead Kubernetes operations and manage GPU resources.
  • Implement Infrastructure-as-Code workflows.
  • Build CI/CD pipelines and develop observability stacks.
  • Run and improve the 24×7 on-call rotation.

Skills

Kubernetes
Python
Bash
Terraform
Ansible
GPU scheduling

Education

BS/MS/PhD in Computer Science, Electrical Engineering, or related field

Tools

GitOps
Prometheus
Grafana
OpenTelemetry
NVIDIA DCGM
Job description

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

This is a contract position.

What You’ll Do
  • Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.

  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.

  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

  • Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks.

  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications
  • BS/MS/PhD in CS, EE, or related field.

  • 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.

  • Expert‑level Kubernetes experience.

  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

  • GPU schedulers such as Slurm or Kueue.

  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have
  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

  • Familiarity with CI/CD tooling (GitHub Actions, BuildKit).

  • Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.