Enable job alerts via email!

HPC Site Reliability Engineer (SRE) Engineering · US, Remote Working ·

ORI

United States

Remote

USD 100,000 - 125,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a pioneering company at the forefront of AI and HPC technology as an HPC Site Reliability Engineer. In this dynamic role, you will manage and optimize high-performance computing environments, ensuring their reliability and performance. You will collaborate with cross-functional teams to drive innovations that enhance user experiences while providing 24/7 support for critical systems. Your expertise in networking technologies and automation will be key in maintaining the infrastructure, troubleshooting issues, and implementing monitoring strategies. This is a fantastic opportunity to make a significant impact in a fast-paced, cutting-edge environment that values continuous improvement and collaboration.

Qualifications

  • 6+ years in networking and data centre operations with HPC architectures.
  • Expertise in TCP/UDP, IPv4/IPv6, BGP/MP-BGP, and other networking protocols.

Responsibilities

  • Maintain and optimise HPC infrastructure for reliability and performance.
  • Lead investigations into high-priority incidents and prepare Root Cause Analysis.

Skills

Networking Technologies
HPC Architecture
Ansible
Terraform
Linux OS
Scripting
Problem Solving

Education

Bachelor’s or Master’s in Telecommunications
Bachelor’s or Master’s in Computer Science
Bachelor’s or Master’s in Electrical and Computer Engineering

Tools

Grafana Cloud
ELK
NVIDIA UFM
NetQ

Job description

Job Purpose

Ori is the AI Native GPU Cloud - this role is pivotal in shaping our HPC infrastructure. As an HPC SRE, you will play a crucial role in managing, optimising, and ensuring the reliability of our high-performance computing environments. You will be the go-to expert for all technical aspects of our HPC infrastructure, including system architecture, optimization, integrations, and networking. You will collaborate with cross-functional teams, driving innovations that align with business objectives and enhance user experiences. This role also ensures 24/7 support, maintaining high availability and performance for HPC systems.

Key Responsibilities

Infrastructure Management
  • Maintain and optimise HPC infrastructure, ensuring reliability and performance of Nvidia-based systems.
  • Set up HPC clusters with DGX or HGX platforms, GPU Direct, and establish network optimization for server-to-storage or storage-to-storage connectivity, including multi-cloud and WAN HPC interconnectivity.
  • Configure, troubleshoot, and quick-fix Networking R&S hardware from Cisco, Juniper, or relevant vendors.
Automation and Efficiency
  • Write, execute, and debug Ansible Playbooks for Cumulus Linux automation.
  • Utilise and maintain automated configuration management systems such as Ansible and Terraform.
Incident Management
  • Lead investigations into high-priority incidents, identify solutions, and prepare Root Cause Analysis (RCA).
  • Proactively monitor data centre health checks, licensing, and life-cycle management upgrades.
  • Provide 24/7 support through on-call rotations, ensuring continuous availability and rapid incident response.
Monitoring and Observability
  • Use observability metrics tools like Grafana Cloud, ELK, NVIDIA UFM, NetQ, and QoS metrics to monitor system health and performance.
  • Develop and implement monitoring strategies to ensure high availability and performance of HPC systems.
Collaboration and Communication
  • Collaborate with HPC solution architects and engineers to drive innovation and optimization.
  • Provide regular reports on P1/P2 incidents, RCAs, life-cycle upgrades, and change/incident management actions to senior management.
  • Maintain comprehensive documentation of infrastructure audits and policy changes.

Key Objectives and Goals

  • Reliability: Achieve and maintain high availability and uptime for HPC systems.
  • Performance: Continuously optimise the performance of Nvidia-based and other HPC systems.
  • Scalability: Develop scalable HPC solutions to support ongoing business growth.
  • Automation: Increase the level of automation to enhance efficiency and reduce manual tasks.
  • Continuous Availability: Ensure 24/7 support through effective coverage and on-call practices.
  • Collaboration: Foster a collaborative environment within the SRE teams and with other departments.
  • Continuous Improvement: Promote a culture of ongoing learning and improvement.

Key Metrics

  • MTTR (Mean Time to Recovery): Measure and minimise the time taken to recover from incidents.
  • MTBF (Mean Time Between Failures): Monitor and maximise the time between system failures.
  • System Uptime: Track and maintain high levels of system availability.
  • Service Level Objectives (SLOs): Set and meet clear SLOs for reliability and performance.
  • Service Level Indicators (SLIs): Define and monitor SLIs to ensure service quality.

Required Qualifications

  • Bachelor’s or Master’s degree in Telecommunications, Computer Science, Electrical and Computer Engineering (ECE), or related field.
  • 6+ years of proven experience in networking and data centre operations, particularly with recent HPC architectures, NetDevOps workflows, NVIDIA Air, and GNS3 simulations.
  • 3+ years of experience as a Site Reliability Engineer or in a similar role.
  • Expertise in networking technologies: TCP/UDP, IPv4/IPv6, BGP/MP-BGP, VPN, L2 switching, EVPN, VxLAN, SHARP, Segment Routing, BGP, MPLS, IS-IS, DWDM.
  • In-depth knowledge of network protocols such as RoCE, RDMA, IBoE, and network topologies like Spine Leaf, Link/Super Spine Switching, and Fat-Free topology.
  • Background in troubleshooting or testing server hardware/firmware, Linux OS, CLIs, and scripting.
  • Excellent problem-solving and on-demand decision-making skills.

Desired Skills

  • Certifications equivalent to CCIE, JNCIS, or InfiniBand NCP-IB.
  • Experience with automated configuration management systems like Ansible and Terraform.
  • Ability to handle high-pressure situations in HPC AI data centres.
  • Strong collaboration skills with HPC solution architects and engineers.

This job description is not intended to be all-inclusive. Employees may perform other duties as assigned.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

HPC Site Reliability Engineer (SRE) Engineering US, Remote Working

ORI

Remote

USD 120,000 - 160,000

Today
Be an early applicant