Enable job alerts via email!

Staff Site Reliability Engineer

Ipro Networks Pte. Ltd.

Palo Alto (CA)

Remote

USD 200,000 - 250,000

Full time

3 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Ipro Networks, a cutting-edge AI infrastructure firm, is seeking a Staff Site Reliability Engineer to enhance its GPU infrastructure. This role demands the design of resilient systems, managing multiple clusters, and ensuring continuous uptime while offering a competitive salary and equity package.

Benefits

Competitive equity package (stock options)
Comprehensive health benefits
Generous PTO and flexible work policies
Support for ongoing professional development

Qualifications

  • 7+ years of experience as a reliability, infrastructure, or production engineer.
  • Deep knowledge of GPU infrastructure and cloud networking.
  • Proficient in scripting or programming languages.

Responsibilities

  • Manage thousands of GPUs across multiple cloud providers.
  • Design scalable solutions for AI model training and data processing.
  • Set up monitoring systems to proactively detect issues.

Skills

Troubleshooting
Systems Thinking
Collaboration
Programming
Infrastructure Management

Tools

Kubernetes
Terraform
Prometheus
Grafana
DataDog
Splunk

Job description

Staff Site Reliability Engineer (Remote, US)

Compensation: $200K–$250K + Equity
Full-Time | Remote | Infrastructure Team

We’re hiring a Staff Reliability Engineer to help scale and maintain the massive GPU infrastructure that powers our cutting-edge AI systems. If you're passionate about building robust, scalable systems and solving deep infrastructure challenges at scale, this role is for you.

What You’ll Be Doing
  • Work closely with engineers and researchers to define and meet system performance, availability, and efficiency requirements.

  • Operate and manage thousands of GPUs distributed across multiple cloud providers and clusters.

  • Design scalable solutions to support rapid growth in compute demands for AI model training, data processing, and inference.

  • Build resilient, fault-tolerant systems to ensure continuous uptime and seamless performance.

  • Develop automation tools to eliminate toil and streamline infrastructure operations.

  • Set up and maintain monitoring systems to proactively detect issues and drive performance improvements.

  • Define and track SLOs and SLIs that uphold system reliability standards.

  • Participate in an on-call rotation to ensure 24/7 system availability.

Qualifications
  • Proven 7+ years of experience as a reliability engineer, infrastructure engineer, or production engineer in fast-paced, high-growth environments.

  • Deep knowledge of GPU infrastructure, including scheduling, scaling, cloud networking, storage, and security.

  • Proficiency in one or more scripting or programming languages.

  • Strong experience with Kubernetes or similar container orchestration systems.

  • Familiarity with Infrastructure-as-Code tools like Terraform or CloudFormation.

  • Experience working with observability tools like Prometheus, Grafana, DataDog, ELK, or Splunk.

  • Excellent troubleshooting, debugging, and systems thinking.

  • Strong communication skills and a collaborative mindset.

  • Bonus: Experience in AI/ML infrastructure, or managing large-scale GPU clusters.

What We're Building

We're developing highly complex infrastructure to support advanced AI research and production systems running on thousands of GPUs. This is an opportunity to work on some of the most demanding reliability and performance challenges in tech today—at scale. You’ll have direct impact on how infrastructure supports foundation model development and deployment.

Compensation & Benefits

Base Salary: $200K–$250K/year
Competitive equity package (stock options)
Comprehensive health benefits
Generous PTO and flexible work policies
Support for ongoing professional development

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Staff Site Reliability Engineer - Kubernetes

Fivetran

Oakland

Hybrid

USD 186.000 - 234.000

5 days ago
Be an early applicant

Staff Site Reliability Engineer

Energy Vault

San Francisco

Hybrid

USD 180.000 - 250.000

4 days ago
Be an early applicant

Staff Site Reliability Engineer (Staff SRE) (Remote)

SailPoint

Remote

USD 129.000 - 240.000

23 days ago

Staff Site Reliability Engineer

Moveworks

Mountain View

On-site

USD 227.000 - 290.000

8 days ago

Staff Functional Safety Engineer

Rivian

Palo Alto

On-site

USD 186.000 - 233.000

5 days ago
Be an early applicant

Staff Functional Safety Engineer

Davita Inc.

Palo Alto

On-site

USD 186.000 - 233.000

7 days ago
Be an early applicant

Staff Site Reliability Engineer

ZipRecruiter

Palo Alto

On-site

USD 180.000 - 210.000

30+ days ago

Senior/Staff Site Reliability Engineer

Energy Vault

San Francisco

Hybrid

USD 183.000 - 250.000

4 days ago
Be an early applicant

Staff Reliability Engineer/ Sustaining

Rivian and Volkswagen Group Technologies

Palo Alto

On-site

USD 171.000 - 214.000

30+ days ago