Enable job alerts via email!

HPC Site Reliability Engineer

Trust In SODA

San Francisco (CA)

Remote

USD 200,000 - 220,000

Full time

7 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading tech company seeks a Senior HPC Site Reliability Engineer to ensure the reliability and performance of cutting-edge Nvidia-based HPC systems. This remote role offers high-impact work, autonomy, and collaboration in a dynamic environment. Ideal candidates will have extensive HPC and networking experience, making significant contributions to the company's infrastructure.

Benefits

Medical insurance
Vision insurance
401(k)

Qualifications

  • 6+ years in HPC or networking-heavy roles.
  • Expertise in BGP, EVPN, VxLAN, RDMA.
  • Experience in high-stakes environments as an SRE.

Responsibilities

  • Set up and optimize HPC clusters and networks.
  • Debug low-level networking issues.
  • Automate configurations with Ansible and Terraform.
  • Monitor systems with Grafana and other tools.

Skills

HPC
Networking
Automation
Reliability Engineering

Job description

This range is provided by Trust In SODA. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

$200,000.00/yr - $220,000.00/yr

Direct message the job poster from Trust In SODA

Headhunter - I build Data and Analytics teams for established and scaling banks in the U.S.

Love solving gnarly problems in AI infrastructure?

Our client is building the AI Native GPU Cloud—and we need a senior HPC Site Reliability Engineer to keep it humming.

You’ll own the reliability and performance of our cutting-edge Nvidia-based HPC systems. Think DGX clusters, RoCE topologies, and automation pipelines built in Ansible and Terraform. If that lights you up, read on.

This role is remote (US-based) and offers the chance to shape our infrastructure from the ground up. Expect high-impact work, loads of autonomy, and collaboration with smart folks across architecture, engineering, and ops.

You’ll:

  • Set up and optimize HPC clusters and networks (think DGX, HGX, GPU Direct)
  • Debug low-level networking issues with Cisco, Juniper, and more
  • Automate configs with Ansible + Terraform
  • Monitor everything with Grafana, UFM, ELK, NetQ
  • Own 24/7 reliability, on-call, and root cause analysis

This role is perfect if you:

  • Have 6+ years in HPC or networking-heavy roles
  • Know BGP, EVPN, VxLAN, RDMA inside and out
  • Have SRE experience in high-stakes environments
  • Love solving infra puzzles at scale

Bonus points for CCIE/JNCIS, InfiniBand, or cloud/HPC interconnect experience.

Sound like your kind of challenge? Hit apply and let’s talk.

Seniority level
  • Seniority level
    Not Applicable
Employment type
  • Employment type
    Full-time
Job function
  • Job function
    Information Technology
  • Industries
    Computer Networking Products, Software Development, and IT Services and IT Consulting

Referrals increase your chances of interviewing at Trust In SODA by 2x

Inferred from the description for this job

Medical insurance

Vision insurance

401(k)

Get notified about new Site Reliability Engineer jobs in San Francisco Bay Area.

United States $147,000.00-$208,000.00 2 weeks ago

Santa Clara, CA $101,000.00-$161,000.00 2 days ago

Senior Site Reliability Engineer - remote
Site Reliability Engineer (SRE, Remote US)

San Francisco, CA $120,000.00-$160,000.00 3 months ago

Staff Site Reliability Engineer - remote

Santa Clara, CA $158,000.00-$198,000.00 3 days ago

Software Engineer (L5) - Open Connect Platform

United States $100,000.00-$720,000.00 2 days ago

Palo Alto, CA $165,000.00-$185,000.00 10 hours ago

San Francisco, CA $93,000.00-$104,000.00 4 days ago

Santa Clara, CA $128,000.00-$150,000.00 1 day ago

Palo Alto, CA $180,000.00-$210,000.00 10 hours ago

Software Engineer Internship (12 months)

San Francisco, CA $120,000.00-$180,000.00 3 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Site Reliability Engineer

WorkOS

San Francisco

Remote

USD 175,000 - 250,000

2 days ago
Be an early applicant

Software Engineer, Site Reliability (Senior or Staff)

Recruiting From Scratch

San Francisco

Remote

USD 175,000 - 225,000

3 days ago
Be an early applicant

Staff Site Reliability Engineer

Ipro Networks Pte. Ltd.

Palo Alto

Remote

USD 200,000 - 250,000

4 days ago
Be an early applicant

Senior Platform Engineer

DTEX Systems

Fremont

Remote

USD 170,000 - 220,000

3 days ago
Be an early applicant

Lead Site Reliability Engineer

Corelight

San Francisco

Remote

USD 184,000 - 229,000

30+ days ago

Senior Site Reliability Engineer San Francisco Bay Area (CA), Denver (CO), Lexington (KY), New [...]

AppOmni Inc.

San Francisco

Remote

USD 156,000 - 212,000

30+ days ago

Senior Platform Engineer

ZipRecruiter

Fremont

Remote

USD 170,000 - 220,000

8 days ago

Senior Site Reliability Engineer (Remote)

3C Deutschland GmbH

Remote

USD 133,000 - 240,000

3 days ago
Be an early applicant

Site Reliability Engineer

Offchain Labs

Remote

USD 100,000 - 720,000

6 days ago
Be an early applicant