Enable job alerts via email!

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Santa Clara (CA)

On-site

USD 148,000 - 288,000

Full time

2 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a forward-thinking company as a Senior AI Infrastructure Engineer, where you'll design and maintain large-scale production systems. This role offers the opportunity to work with cutting-edge technologies in AI and cloud infrastructure, ensuring maximum reliability and performance for GPU cloud services. Collaborate with a diverse team that values innovation, problem-solving, and personal growth. If you thrive in a dynamic environment and are passionate about engineering excellence, this is the perfect role for you. Embrace the challenge and make a significant impact in the tech industry!

Qualifications

  • 5+ years of experience in systems engineering and cloud environments.
  • Proficiency in coding and designing distributed systems.

Responsibilities

  • Design, build, and operate internal tooling for AI training platforms.
  • Manage the service lifecycle from design to deployment and operation.

Skills

Python
Go
C/C++
Java
Linux
Networking
Storage
Containers
Infrastructure Automation
Distributed Systems

Education

Bachelor's in Computer Science

Tools

Kubernetes
Terraform
Slurm

Job description

Pay Competitive

Employment type Full-Time

Job Description
    Req#: JR1997176

    NVIDIA is looking for an outstanding, passionate, and talented Senior AI Infrastructure Engineer to join our DGX Cloud SRE group. This engineering role will design, build, and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. This role requires knowledge across systems, networking, coding, databases, capacity management, continuous delivery and deployment, and open-source cloud technologies like Kubernetes and OpenStack.

    DGX Cloud SRE at NVIDIA ensures that our GPU cloud services—both internal and external—run with maximum reliability and uptime. The team manages system changes carefully through planning, capacity, and performance management. NVIDIA values diversity, curiosity, problem-solving, and openness. Our culture encourages collaboration, innovation, risk-taking in a blame-free environment, and supports personal growth through mentorship and meaningful projects.

    What You’ll Be Doing:
    • Design, build, deploy, and operate internal tooling for large-scale AI training and inference platforms built on cloud infrastructure.
    • Conduct performance analysis on multi-GPU, multi-node clusters.
    • Manage the entire service lifecycle from design to deployment, operation, and refinement.
    • Support services pre-launch with system design, software tools, capacity planning, and launch reviews.
    • Monitor and maintain system health, availability, and latency post-launch.
    • Scale systems sustainably via automation and improve system reliability and speed.
    • Practice incident response and conduct blameless postmortems.
    • Participate in on-call rotations to support production systems.
    What We Need To See:
    • BS in Computer Science or related field involving coding, or equivalent experience.
    • 5+ years of professional experience.
    • Proven ability to initiate projects, collaborate, and contribute effectively to team efforts.
    • Experience with infrastructure automation and designing distributed systems for cloud environments.
    • Proficiency in Python, Go, C/C++, or Java.
    • Deep knowledge of Linux, Networking, Storage, and Containers.
    • Experience with public cloud platforms, Infrastructure as Code (IaC), and Terraform.
    • Experience with distributed systems.
    Ways to stand out from the crowd:
    • Interest in large-scale distributed system analysis and troubleshooting.
    • Strong problem-solving, communication, ownership, and initiative.
    • Ability to debug, optimize, and automate tasks; experience with Kubernetes or Slurm is a plus.

    NVIDIA is a top employer in tech, driven by innovative and dedicated people. If you're creative, autonomous, and love challenges, we want to hear from you. We lead in AI, HPC, and Visualization, with our GPU serving as the visual cortex of modern computers. We value diversity and are committed to equal opportunity employment.

    The base salary range is $148,000 - $287,500, determined by location, experience, and market pay. Benefits and equity are also provided.

    NVIDIA accepts applications ongoing and is committed to a diverse workplace, promoting inclusion regardless of race, religion, gender, age, or other protected characteristics.

    About the company

    Nvidia Corporation, based in Santa Clara, California, is a leading multinational technology company incorporated in Delaware.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Remote

USD 144,000 - 271,000

12 days ago

Senior AI Infrastructure Engineer - DGX Cloud

Nvidia Corporation in

Santa Clara

On-site

USD 144,000 - 271,000

10 days ago

HPC Engineer

RCH Solutions

San Francisco

Remote

USD 90,000 - 150,000

8 days ago

Platform Architect - AWS

Quantiphi

Marlborough

Remote

USD 125,000 - 228,000

Today
Be an early applicant

Technical Support Engineer, Linux and HPC Admin - DGX Cloud

NVIDIA Corporation

Santa Clara

On-site

USD 108,000 - 202,000

Yesterday
Be an early applicant

AI Infrastructure Engineer - HPC

Cisco Systems, Inc.

California

On-site

USD 120,000 - 170,000

6 days ago
Be an early applicant

AI Solutions Architect – NVIDIA

DDN

San Francisco

On-site

USD 143,000 - 177,000

Today
Be an early applicant

AI Solutions Architect – NVIDIA

DataDirect Networks, Inc.

San Francisco

Hybrid

USD 120,000 - 180,000

5 days ago
Be an early applicant

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Santa Clara

On-site

USD 144,000 - 271,000

7 days ago
Be an early applicant