Enable job alerts via email!

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

United States

Remote

USD 144,000 - 271,000

Full time

11 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking systems and software engineers to join their dynamic team. This role involves designing and maintaining internal tooling on cloud infrastructure, developing data pipelines for leadership decision-making, and integrating workflows to streamline incident management. With a focus on operational excellence, you'll tackle technical challenges and promote best practices across the organization. If you're passionate about innovation and eager to make an impact in a fast-paced environment, this opportunity is perfect for you.

Qualifications

  • 5+ years of experience in systems and software engineering.
  • Strong background in infrastructure automation and distributed systems.

Responsibilities

  • Design and maintain internal tooling on cloud infrastructure.
  • Develop and maintain data pipelines for business decision-making.

Skills

Python
Go
Typescript
C/C++
Java
Linux
Networking
Storage
Containers

Education

BSc in Computer Science

Tools

Kubernetes
Terraform
Docker
Helm
Hive
Apache Beam
Spark
Looker
Tableau

Job description

DGXC SRE at NVIDIA

At NVIDIA, our DGXC SRE team ensures that our internal and external GPU cloud services operate with maximum reliability and uptime, fulfilling our promises to users. We enable developers to modify existing systems through careful planning, while monitoring capacity, latency, and performance.

We seek systems and software engineers interested in building tooling, reporting, automation, and ML solutions to promote operational excellence across a dynamic organization, addressing technical challenges to improve operational efficiency across teams.

What you’ll be doing:
  1. Designing, building, deploying, and maintaining internal tooling on cloud infrastructure to support operational excellence.
  2. Developing, deploying, and maintaining data pipelines used by leadership for business decision-making.
  3. Integrating tooling with internal workflows, customer processes, and cloud service providers to streamline incident management.
  4. Reducing operational toil related to incident handling, postmortems, and on-call responsibilities.
  5. Promoting blameless incident prevention and response practices.
  6. Providing operational best practices consultation to peer teams.
What we need to see:
  • BSc in Computer Science or related technical field involving coding, or equivalent experience.
  • 5+ years of relevant experience.
  • A proven track record of initiating projects and collaborating effectively on team initiatives.
  • Experience with infrastructure automation and designing distributed systems for large-scale cloud operations.
  • Proficiency in one or more of Python, Go, Typescript, C/C++, Java.
  • Deep knowledge in Linux, Networking, Storage, or Containers.
Ways to stand out from the crowd:
  • Experience with incident tooling such as FireHydrant, Rootly, incident.io, Blameless, and plugin development in Backstage.
  • Background in infrastructure tech like Kubernetes, Terraform, Docker, Helm, and familiarity with ML/data science tools like Hive, Apache Beam, Spark.
  • Experience with business analytics tools like Looker, Tableau, and a systematic approach to problem-solving, combined with strong communication skills and ownership.

NVIDIA is recognized as a top employer in technology, known for innovative work in AI, HPC, and Visualization. Our inventions, like the GPU, are central to modern computing. We seek creative, autonomous, and motivated individuals eager to face challenges. NVIDIA offers a competitive salary range of $144,000 to $270,250, plus equity and benefits, with salary determined by location and experience. We are committed to diversity and equal opportunity employment, welcoming applicants regardless of race, religion, gender, or other protected characteristics.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior AI Infrastructure Engineer

T-Mobile

Bothell

On-site

USD 113,000 - 205,000

6 days ago
Be an early applicant

Senior AI Infrastructure Engineer

T-Mobile

Bellevue

On-site

USD 113,000 - 205,000

6 days ago
Be an early applicant

Senior AI Infrastructure Engineer

T-Mobile

Frisco

On-site

USD 113,000 - 205,000

6 days ago
Be an early applicant

Senior AI Infrastructure Engineer

T-Mobile

Overland Park

On-site

USD 113,000 - 205,000

6 days ago
Be an early applicant

Senior AI Infrastructure Engineer

T-Mobile

Overland Park

On-site

USD 113,000 - 205,000

7 days ago
Be an early applicant

Solution Architect, GenAI

Lenovo

Remote

USD 170,000 - 190,000

30+ days ago

HPC Engineer

RCH Solutions

San Francisco

Remote

USD 90,000 - 150,000

7 days ago
Be an early applicant

Managing Consulting Engineer - NVIDIA Solutions

CDW

On-site

USD 140,000 - 160,000

4 days ago
Be an early applicant

Senior Data Engineer, Cloud Operations Engineering

NVIDIA

Remote

USD 136,000 - 265,000

26 days ago