Enable job alerts via email!

Senior DGX Cloud Software Engineer - Infrastructure Automation and Distributed Systems

NVIDIA

United States

Remote

USD 144,000 - 271,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading technology company is seeking Software Engineers to build and run cloud infrastructure services. You will support AI training and inference development while ensuring operational capacity and reliability in a dynamic environment. Ideal candidates will have strong programming skills and experience in infrastructure management.

Benefits

Equity
Benefits

Qualifications

  • 5+ years of relevant experience in infrastructure and fleet management engineering.
  • Experience with infrastructure automation and distributed systems design.

Responsibilities

  • Design, build, and run cloud infrastructure services to meet business goals.
  • Participate in defining service level objectives and error budgets.

Skills

Python
Go
Linux
Kubernetes
Systems Networking

Education

BS degree in Computer Science
Physics
Mathematics

Tools

Slurm
Docker
OpenStack

Job description

We are seeking Software Engineers with experience in building and managing private and public clouds at production scale. Join the DGX Cloud team to support AI training and inference development by creating platforms, tools, and services that ensure the operational capacity of our bare-metal, accelerated compute infrastructure and promote reliability best practices within the DGX Cloud ecosystem.

What you’ll be doing:

  • Design, build, and operate cloud infrastructure services aligned with business goals, including integrations, migrations, updates, and decommissions.
  • Define internal service level objectives and error budgets as part of our observability strategy.
  • Automate repetitive tasks to improve efficiency where automation yields a positive ROI.
  • Engage in incident prevention and response, participating in an on-call rotation.
  • Consult with peer teams on systems design best practices.
  • Contribute to a culture of values-driven introspection, communication, and self-organization.

What we need to see:

  • Proficiency in Python or Go.
  • BS degree in Computer Science, related technical field, or equivalent experience.
  • 5+ years in infrastructure and fleet management engineering.
  • Experience with automation and distributed systems for large-scale cloud environments in production.
  • A demonstrated ability to initiate projects, collaborate effectively, and lead initiatives.
  • Deep knowledge of Linux, Slurm, Kubernetes, Storage, and Networking.

Ways to stand out from the crowd:

  • Systematic problem-solving skills, clear communication, ownership, and results-oriented mindset.
  • Experience with bare metal as a service (BMaaS), multi-cloud infrastructure, and reliability engineering practices.
  • Knowledge of accelerated compute technologies like BlueField Networking, Infiniband, NVMesh, NCCL, and security collaboration experience.
  • Optional but a plus: experience in ML/AI roles or related projects.

NVIDIA leads in AI, HPC, and Visualization. Our GPU technology powers innovation from AI to autonomous vehicles. We seek talented, creative, and autonomous individuals to help us accelerate AI advancements.

The base salary range is $144,000 - $270,250, determined by location, experience, and market factors. Benefits and equity are included. NVIDIA is an equal opportunity employer committed to diversity and inclusion.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.