Enable job alerts via email!

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Santa Clara (CA)

On-site

USD 148,000 - 288,000

Full time

2 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a forward-thinking company as a Senior AI Infrastructure Engineer, where you'll design and maintain large-scale production systems. This role offers the opportunity to work with cutting-edge technologies in AI and cloud infrastructure, ensuring maximum reliability and performance for GPU cloud services. Collaborate with a diverse team that values innovation, problem-solving, and personal growth. If you thrive in a dynamic environment and are passionate about engineering excellence, this is the perfect role for you. Embrace the challenge and make a significant impact in the tech industry!

Qualifications

5+ years of experience in systems engineering and cloud environments.
Proficiency in coding and designing distributed systems.

Responsibilities

Design, build, and operate internal tooling for AI training platforms.
Manage the service lifecycle from design to deployment and operation.

Skills

Python

C/C++

Java

Linux

Networking

Storage

Containers

Infrastructure Automation

Distributed Systems

Education

Bachelor's in Computer Science

Tools

Kubernetes

Terraform

Slurm

Pay Competitive

Employment type Full-Time

Job Description

NVIDIA is looking for an outstanding, passionate, and talented Senior AI Infrastructure Engineer to join our DGX Cloud SRE group. This engineering role will design, build, and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. This role requires knowledge across systems, networking, coding, databases, capacity management, continuous delivery and deployment, and open-source cloud technologies like Kubernetes and OpenStack.

DGX Cloud SRE at NVIDIA ensures that our GPU cloud services—both internal and external—run with maximum reliability and uptime. The team manages system changes carefully through planning, capacity, and performance management. NVIDIA values diversity, curiosity, problem-solving, and openness. Our culture encourages collaboration, innovation, risk-taking in a blame-free environment, and supports personal growth through mentorship and meaningful projects.

What You’ll Be Doing:

Design, build, deploy, and operate internal tooling for large-scale AI training and inference platforms built on cloud infrastructure.
Conduct performance analysis on multi-GPU, multi-node clusters.
Manage the entire service lifecycle from design to deployment, operation, and refinement.
Support services pre-launch with system design, software tools, capacity planning, and launch reviews.
Monitor and maintain system health, availability, and latency post-launch.
Scale systems sustainably via automation and improve system reliability and speed.
Practice incident response and conduct blameless postmortems.
Participate in on-call rotations to support production systems.

What We Need To See:

BS in Computer Science or related field involving coding, or equivalent experience.
5+ years of professional experience.
Proven ability to initiate projects, collaborate, and contribute effectively to team efforts.
Experience with infrastructure automation and designing distributed systems for cloud environments.
Proficiency in Python, Go, C/C++, or Java.
Deep knowledge of Linux, Networking, Storage, and Containers.
Experience with public cloud platforms, Infrastructure as Code (IaC), and Terraform.
Experience with distributed systems.

Ways to stand out from the crowd:

Interest in large-scale distributed system analysis and troubleshooting.
Strong problem-solving, communication, ownership, and initiative.
Ability to debug, optimize, and automate tasks; experience with Kubernetes or Slurm is a plus.

NVIDIA is a top employer in tech, driven by innovative and dedicated people. If you're creative, autonomous, and love challenges, we want to hear from you. We lead in AI, HPC, and Visualization, with our GPU serving as the visual cortex of modern computers. We value diversity and are committed to equal opportunity employment.

The base salary range is $148,000 - $287,500, determined by location, experience, and market pay. Benefits and equity are also provided.

NVIDIA accepts applications ongoing and is committed to a diverse workplace, promoting inclusion regardless of race, religion, gender, age, or other protected characteristics.

About the company

Nvidia Corporation, based in Santa Clara, California, is a leading multinational technology company incorporated in Delaware.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs