Enable job alerts via email!

Senior AI Infrastructure Engineer - DGX Cloud

Nvidia Corporation in

Santa Clara (CA)

On-site

USD 144,000 - 271,000

Full time

11 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a forward-thinking company as a Senior AI Infrastructure Engineer, where you'll design and deploy cloud-based tooling to enhance operational excellence. This role offers the chance to work on innovative projects that streamline incident management and improve efficiency across teams. With a focus on building data pipelines and integrating tools, you'll play a key role in supporting executive leadership. This established industry player values creativity and autonomy, making it an exciting opportunity for those passionate about AI and cloud technologies.

Benefits

Equity options
Health insurance
Flexible working hours
Professional development
Diversity and inclusion programs

Qualifications

  • 5+ years of experience in systems and software engineering.
  • Experience with infrastructure automation and distributed systems.

Responsibilities

  • Design and operate internal tooling based on cloud infrastructure.
  • Develop and maintain data pipelines for executive decision-making.

Skills

Python
Go
Typescript
C/C++
Java
Linux
Networking
Storage
Containers
Infrastructure Automation

Education

BSc in Computer Science

Tools

Kubernetes
Terraform
Docker
Helm
Hive
Apache Beam
Spark
Looker
Tableau
FireHydrant

Job description

Senior AI Infrastructure Engineer - DGX Cloud (Finance)

DGXC SRE at NVIDIA ensures that our internal and external facing GPU cloud services run with maximum reliability and uptime, as promised to users. We enable developers to make changes to the existing system through careful preparation and planning, while monitoring capacity, latency, and performance.

We are seeking systems and software engineers interested in building tooling, reporting, automation, and ML solutions to enable operational excellence across a dynamic organization, solving technical problems that improve operational efficiency across multiple teams.

What you'll be doing:
  • Design, build, deploy, and operate internal tooling based on cloud infrastructure to support operational excellence.
  • Develop, implement, and maintain data pipelines used by executive leadership for decision-making.
  • Integrate tooling with internal and customer workflows, as well as cloud service providers, to streamline incident management processes.
  • Reduce operational toil related to incident handling, postmortems, and on-call tasks.
  • Promote sustainable, blameless incident prevention and response practices.
  • Provide operational best practices consultation to peer teams.
What we need to see:
  • BSc in Computer Science or a related technical field involving coding, or equivalent experience.
  • 5+ years of relevant experience.
  • A proven track record of initiating projects, collaborating effectively, and contributing to team projects.
  • Experience with infrastructure automation and designing distributed systems for large-scale cloud environments.
  • Proficiency in one or more of the following: Python, Go, Typescript, C/C++, Java.
  • Deep knowledge of Linux, Networking, Storage, or Containers.
Ways to stand out:
  • Experience with incident tooling such as FireHydrant, Rootly, incident.io, or blameless, including plugin and schema development in Backstage.
  • Background in infrastructure technologies like Kubernetes, Terraform, Docker, Helm, and basic ML/data science tools like Hive, Apache Beam, Spark.
  • Experience with business analytics tools such as Looker or Tableau, and a systematic approach to problem-solving, communication, ownership, and initiative.

NVIDIA is recognized as one of the most desirable employers in the tech industry, known for innovative work in AI, HPC, and Visualization. We invite creative, autonomous, and motivated individuals to join us. Our inventions, including the GPU, are central to modern computing and our products.

The salary range is $144,000 - $270,250, determined by location, experience, and comparable roles. Additional benefits and equity are offered. Applications are accepted on an ongoing basis.

NVIDIA is committed to diversity and equal opportunity, welcoming applicants regardless of race, religion, gender, age, or other protected characteristics.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Santa Clara

On-site

USD 148.000 - 288.000

2 days ago
Be an early applicant

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Remote

USD 144.000 - 271.000

13 days ago

HPC Engineer

RCH Solutions

San Francisco

Remote

USD 90.000 - 150.000

9 days ago

Platform Architect - AWS

Quantiphi

Marlborough

Remote

USD 125.000 - 228.000

Yesterday
Be an early applicant

AI Infrastructure Engineer - HPC

Cisco Systems, Inc.

California

On-site

USD 120.000 - 170.000

6 days ago
Be an early applicant

Technical Support Engineer, Linux and HPC Admin - DGX Cloud

NVIDIA Corporation

Santa Clara

On-site

USD 108.000 - 202.000

2 days ago
Be an early applicant

AI Solutions Architect – NVIDIA

DDN

San Francisco

On-site

USD 143.000 - 177.000

Yesterday
Be an early applicant

AI Solutions Architect – NVIDIA

DataDirect Networks, Inc.

San Francisco

Hybrid

USD 120.000 - 180.000

5 days ago
Be an early applicant

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Santa Clara

On-site

USD 144.000 - 271.000

8 days ago