Enable job alerts via email!

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Santa Clara (CA)

On-site

USD 144,000 - 271,000

Full time

9 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Senior Site Reliability Engineer to enhance its GPU cloud services. This role involves designing and supporting large-scale Kubernetes clusters, ensuring maximum reliability and uptime. You will engage in the entire service lifecycle, from design to operation, while fostering a culture of collaboration and problem-solving. If you have a passion for automation and performance tuning, this is a fantastic opportunity to join a forward-thinking team committed to innovation and excellence in the tech industry.

Qualifications

5+ years of experience in SRE or related fields.
Proficiency in coding and infrastructure automation.

Responsibilities

Design and support large-scale Kubernetes clusters.
Monitor system health and maintain live services.

Skills

Kubernetes

Python

Perl

Ruby

Linux

Networking

Automation

Education

BS in Computer Science

Tools

Docker

OpenStack

Senior Site Reliability Engineer - DGX Cloud

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This involves a combination of software and systems engineering practices. The discipline demands knowledge across various systems, networking, coding, databases, capacity management, continuous delivery, deployment, and open-source cloud technologies like Kubernetes and OpenStack.

SRE at NVIDIA ensures that our GPU cloud services—both internal and external—operate with maximum reliability and uptime. We enable developers to make changes through careful planning while monitoring capacity, latency, and performance. SRE also embodies a mindset and engineering approach to optimize production systems, emphasizing automation, performance tuning, and efficiency.

Our culture values diversity, curiosity, problem-solving, and openness. We foster collaboration, big thinking, risk-taking, and self-direction, supported by mentorship and growth opportunities.

What you'll be doing:

Design, implement, and support the operational and reliability aspects of large-scale Kubernetes clusters, focusing on performance, real-time monitoring, logging, and alerting.
Engage in and improve the entire service lifecycle—from design and deployment to operation and refinement.
Support services before launch through system design consulting, developing tools, capacity management, and review processes.
Maintain live services by monitoring availability, latency, and system health.
Scale systems sustainably via automation and drive improvements in reliability and velocity.
Conduct sustainable incident responses and blameless postmortems.
Participate in on-call rotations to support production systems.

What we need to see:

BS degree in Computer Science or a related technical field involving coding, or equivalent experience.
5+ years of relevant experience.
Experience with infrastructure automation, distributed systems design, and developing tools for large-scale cloud systems.
Proficiency in one or more of the following: Python, Go, Perl, or Ruby.
In-depth knowledge of Linux, networking, and containers.

Ways to stand out from the crowd:

Interest in large-scale distributed systems analysis and troubleshooting.
Strong problem-solving, communication skills, ownership, and initiative.
Ability to debug, optimize, and automate tasks effectively.
Experience operating large private/public cloud systems based on Kubernetes, OpenStack, and Docker.

NVIDIA is considered one of the most desirable employers in the tech industry, with forward-thinking and dedicated professionals. If you're creative, autonomous, and love challenges, we want to hear from you.

The base salary range is $144,000 - $270,250, determined by location, experience, and current market rates. You may also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

We are committed to diversity and equal opportunity, welcoming applicants regardless of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, or other protected characteristics.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

HPC Engineer

RCH Solutions

San Francisco

Remote

USD 90,000 - 150,000

9 days ago

Platform Architect - AWS

Quantiphi

Marlborough

Remote

USD 125,000 - 228,000

Yesterday

Be an early applicant

Technical Support Engineer, Linux and HPC Admin - DGX Cloud

NVIDIA Corporation

Santa Clara

On-site

USD 108,000 - 202,000

2 days ago

Be an early applicant

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Santa Clara

On-site

USD 148,000 - 288,000

2 days ago

Be an early applicant

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Remote

USD 144,000 - 271,000

13 days ago

AI Solutions Architect – NVIDIA

DDN

San Francisco

On-site

USD 143,000 - 177,000

Yesterday

Be an early applicant

AI Solutions Architect – NVIDIA

DataDirect Networks, Inc.

San Francisco

Hybrid

USD 120,000 - 180,000

5 days ago

Be an early applicant

Senior AI Infrastructure Engineer - DGX Cloud

Nvidia Corporation in

Santa Clara

On-site

USD 144,000 - 271,000

10 days ago

Salesforce Service Cloud Architect

Wizr AI

Menlo Park

On-site

USD 155,000 - 215,000

9 days ago

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Santa Clara (CA)

On-site

USD 144,000 - 271,000

Full time

Job summary

Qualifications

Responsibilities

Skills

Education

Tools

Job description

Similar jobs

HPC Engineer

San Francisco

Remote

USD 90,000 - 150,000

Platform Architect - AWS

Marlborough

Remote

USD 125,000 - 228,000

Technical Support Engineer, Linux and HPC Admin - DGX Cloud

Santa Clara

On-site

USD 108,000 - 202,000

Senior AI Infrastructure Engineer - DGX Cloud

Santa Clara

On-site

USD 148,000 - 288,000

Senior AI Infrastructure Engineer - DGX Cloud

Remote

USD 144,000 - 271,000

AI Solutions Architect – NVIDIA

San Francisco

On-site

USD 143,000 - 177,000

AI Solutions Architect – NVIDIA

San Francisco

Hybrid

USD 120,000 - 180,000

Senior AI Infrastructure Engineer - DGX Cloud

Santa Clara

On-site

USD 144,000 - 271,000

Salesforce Service Cloud Architect

Menlo Park

On-site

USD 155,000 - 215,000