Enable job alerts via email!

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Santa Clara (CA)

On-site

USD 144,000 - 271,000

Full time

9 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Senior Site Reliability Engineer to enhance its GPU cloud services. This role involves designing and supporting large-scale Kubernetes clusters, ensuring maximum reliability and uptime. You will engage in the entire service lifecycle, from design to operation, while fostering a culture of collaboration and problem-solving. If you have a passion for automation and performance tuning, this is a fantastic opportunity to join a forward-thinking team committed to innovation and excellence in the tech industry.

Qualifications

  • 5+ years of experience in SRE or related fields.
  • Proficiency in coding and infrastructure automation.

Responsibilities

  • Design and support large-scale Kubernetes clusters.
  • Monitor system health and maintain live services.

Skills

Kubernetes
Python
Go
Perl
Ruby
Linux
Networking
Automation

Education

BS in Computer Science

Tools

Docker
OpenStack

Job description

Senior Site Reliability Engineer - DGX Cloud

Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline focused on designing, building, and maintaining large-scale production systems with high efficiency and availability. This involves a combination of software and systems engineering practices. The discipline demands knowledge across various systems, networking, coding, databases, capacity management, continuous delivery, deployment, and open-source cloud technologies like Kubernetes and OpenStack.

SRE at NVIDIA ensures that our GPU cloud services—both internal and external—operate with maximum reliability and uptime. We enable developers to make changes through careful planning while monitoring capacity, latency, and performance. SRE also embodies a mindset and engineering approach to optimize production systems, emphasizing automation, performance tuning, and efficiency.

Our culture values diversity, curiosity, problem-solving, and openness. We foster collaboration, big thinking, risk-taking, and self-direction, supported by mentorship and growth opportunities.

What you'll be doing:
  • Design, implement, and support the operational and reliability aspects of large-scale Kubernetes clusters, focusing on performance, real-time monitoring, logging, and alerting.
  • Engage in and improve the entire service lifecycle—from design and deployment to operation and refinement.
  • Support services before launch through system design consulting, developing tools, capacity management, and review processes.
  • Maintain live services by monitoring availability, latency, and system health.
  • Scale systems sustainably via automation and drive improvements in reliability and velocity.
  • Conduct sustainable incident responses and blameless postmortems.
  • Participate in on-call rotations to support production systems.
What we need to see:
  • BS degree in Computer Science or a related technical field involving coding, or equivalent experience.
  • 5+ years of relevant experience.
  • Experience with infrastructure automation, distributed systems design, and developing tools for large-scale cloud systems.
  • Proficiency in one or more of the following: Python, Go, Perl, or Ruby.
  • In-depth knowledge of Linux, networking, and containers.
Ways to stand out from the crowd:
  • Interest in large-scale distributed systems analysis and troubleshooting.
  • Strong problem-solving, communication skills, ownership, and initiative.
  • Ability to debug, optimize, and automate tasks effectively.
  • Experience operating large private/public cloud systems based on Kubernetes, OpenStack, and Docker.

NVIDIA is considered one of the most desirable employers in the tech industry, with forward-thinking and dedicated professionals. If you're creative, autonomous, and love challenges, we want to hear from you.

The base salary range is $144,000 - $270,250, determined by location, experience, and current market rates. You may also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

We are committed to diversity and equal opportunity, welcoming applicants regardless of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, disability, or other protected characteristics.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

HPC Engineer

RCH Solutions

San Francisco

Remote

USD 90,000 - 150,000

9 days ago

Platform Architect - AWS

Quantiphi

Marlborough

Remote

USD 125,000 - 228,000

Yesterday
Be an early applicant

Technical Support Engineer, Linux and HPC Admin - DGX Cloud

NVIDIA Corporation

Santa Clara

On-site

USD 108,000 - 202,000

2 days ago
Be an early applicant

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Santa Clara

On-site

USD 148,000 - 288,000

2 days ago
Be an early applicant

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

Remote

USD 144,000 - 271,000

13 days ago

AI Solutions Architect – NVIDIA

DDN

San Francisco

On-site

USD 143,000 - 177,000

Yesterday
Be an early applicant

AI Solutions Architect – NVIDIA

DataDirect Networks, Inc.

San Francisco

Hybrid

USD 120,000 - 180,000

5 days ago
Be an early applicant

Senior AI Infrastructure Engineer - DGX Cloud

Nvidia Corporation in

Santa Clara

On-site

USD 144,000 - 271,000

10 days ago

Salesforce Service Cloud Architect

Wizr AI

Menlo Park

On-site

USD 155,000 - 215,000

9 days ago