Enable job alerts via email!

Senior DevOps Engineer, IPP Sanity Engineering

NVIDIA

Santa Clara (CA)

On-site

USD 168,000 - 334,000

Full time

2 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading tech company is seeking a Senior DevOps Engineer to enhance their cloud services infrastructure. The role involves leading GPU product bringups, optimizing resource utilization, and automating complex processes using cutting-edge tools. Ideal candidates will have extensive experience in software engineering, strong problem-solving skills, and a passion for innovative technology. Join a dynamic team dedicated to pushing the boundaries of AI and computing.

Benefits

Competitive salaries

Generous benefits package

Opportunities for growth

Equity eligibility

Qualifications

10+ years of relevant experience in development and operations.
Hands-on coding and debugging experience on various platforms.

Responsibilities

Lead end-to-end infrastructure bringup execution of new NVIDIA GPU products.
Champion configuration automation using world-class configuration management tools.
Automate and tune performance of regression test frameworks.

Skills

Python

Linux

Debugging

Automation

Problem Solving

Education

Bachelor's or Master's Degree in Computer Science

Tools

Chef

Puppet

Ansible

Terraform

MySQL

NoSQL

Perforce

GIT

Docker

Kubernetes

NVIDIA is looking for a Senior DevOps Engineer to work in IPP (Infrastructure, Planning and Process) Sanity Engineering, to execute on Nvidia product bringups. IPP is a core software infrastructure organization within NVIDIA. This group collaborates with various NVIDIA Software groups such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence, and Driverless Cars to support their infrastructure needs. These cloud services run almost half a million automated jobs daily across thousands of distributed datacenters, enhancing productivity for NVIDIA's software engineers worldwide.

The cloud hosts a heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android) and hardware platforms including NVIDIA GPUs and Tegra Processors. Are you passionate about distributed infrastructure and eager to build the next generation of cloud services for chip bringups? Do you enjoy designing creative solutions, analyzing data to uncover real problems, and implementing fixes? We are looking forward to onboarding a fun-loving person like you.

What you'll be doing

Lead end-to-end infrastructure bringup execution of new NVIDIA GPU products.
Develop a thorough understanding of NVIDIA GPU hardware, display driver stack, SBIOS, VBIOS, and enhance automation for farm-wide updates.
Solve complex problems on groundbreaking pre-release products, leading GPU product bringups (PCIe & Enterprise), integrating GPU test suites with infrastructure harnesses, and scaling multi-site distributed infrastructure.
Optimize farm utilization of GPU resources by identifying appropriate regression test coverage.
Champion configuration automation using world-class configuration management and infrastructure automation (IaC) tools like Chef, Puppet, Ansible, Terraform, etc.
Execute bringup of specialized products for accelerated computing and AI in fast-paced, critical environments.
Lead a service charter responsible for the development, telemetry, and automation of the bringup infrastructure.
Automate and tune performance of regression test frameworks, and create self-healing/automated recovery solutions for multi-geo regression farms.
Engage coherently with collaborators and partner teams across organizations to deliver onboarding of new products in CI/CD pipelines.
Implement seamless multiple parallel bringups within NVIDIA's Product Bringup landscape.

What we need to see

Bachelor's or Master's Degree in Computer Science, Software Engineering, or equivalent experience.
10+ years of relevant experience.
Hands-on coding and debugging experience, including cross-compiling source code on various platforms, triaging, root cause analysis, and resolving issues in the bringup infrastructure.
Familiarity with Linux, Windows (x64 and ARM), VM, and container-based environments.
Programming experience in Python (preferred), Java, or similar languages.
Proficiency with Unix & TCL shell scripting.
Experience with MySQL/NoSQL databases, capable of writing complex queries.
Experience with version control systems like Perforce and GIT.
Demonstrable experience working with large-scale enterprise production systems; 7+ years of development and operations experience required.

Ways to stand out from the crowd

Experience automating bare-metal and VM provisioning.
Knowledge of VM isolation for GPUs and NVIDIA Confidential Computing is a plus.
Experience with public clouds (AWS, GCP, Azure), virtualization technologies (VMware, KVM, HyperV), and container orchestration (Docker, Kubernetes).
Experience debugging GPU performance issues, embedded device software development, automation, driver development, and CUDA/TensorRT applications.

We are considered one of the most desirable employers in the tech industry, with forward-thinking and hardworking teams. If you're passionate, creative, and driven, we'd love to hear from you. We offer competitive salaries, a generous benefits package, and opportunities for growth. Our salary range is $168,000 - $333,500, determined by location, experience, and market rates. You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse, inclusive work environment and is an equal opportunity employer.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs