Job Purpose
Ori is the AI Native GPU Cloud - this role is pivotal in shaping our HPC infrastructure. As an HPC SRE, you will play a crucial role in managing, optimising, and ensuring the reliability of our high-performance computing environments. You will be the go-to expert for all technical aspects of our HPC infrastructure, including system architecture, optimization, integrations, and networking. You will collaborate with cross-functional teams, driving innovations that align with business objectives and enhance user experiences. This role also ensures 24/7 support, maintaining high availability and performance for HPC systems.
Key Responsibilities
Infrastructure Management
- Maintain and optimise HPC infrastructure, ensuring reliability and performance of Nvidia-based systems.
- Set up HPC clusters with DGX or HGX platforms, GPU Direct, and establish network optimization for server-to-storage or storage-to-storage connectivity, including multi-cloud and WAN HPC interconnectivity.
- Configure, troubleshoot, and quick-fix Networking R&S hardware from Cisco, Juniper, or relevant vendors.
Automation and Efficiency
- Write, execute, and debug Ansible Playbooks for Cumulus Linux automation.
- Utilise and maintain automated configuration management systems such as Ansible and Terraform.
Incident Management
- Lead investigations into high-priority incidents, identify solutions, and prepare Root Cause Analysis (RCA).
- Proactively monitor data centre health checks, licensing, and life-cycle management upgrades.
- Provide 24/7 support through on-call rotations, ensuring continuous availability and rapid incident response.
Monitoring and Observability
- Use observability metrics tools like Grafana Cloud, ELK, NVIDIA UFM, NetQ, and QoS metrics to monitor system health and performance.
- Develop and implement monitoring strategies to ensure high availability and performance of HPC systems.
Collaboration and Communication
- Collaborate with HPC solution architects and engineers to drive innovation and optimization.
- Provide regular reports on P1/P2 incidents, RCAs, life-cycle upgrades, and change/incident management actions to senior management.
- Maintain comprehensive documentation of infrastructure audits and policy changes.
Key Objectives and Goals
- Reliability: Achieve and maintain high availability and uptime for HPC systems.
- Performance: Continuously optimise the performance of Nvidia-based and other HPC systems.
- Scalability: Develop scalable HPC solutions to support ongoing business growth.
- Automation: Increase the level of automation to enhance efficiency and reduce manual tasks.
- Continuous Availability: Ensure 24/7 support through effective coverage and on-call practices.
- Collaboration: Foster a collaborative environment within the SRE teams and with other departments.
- Continuous Improvement: Promote a culture of ongoing learning and improvement.
Key Metrics
- MTTR (Mean Time to Recovery): Measure and minimise the time taken to recover from incidents.
- MTBF (Mean Time Between Failures): Monitor and maximise the time between system failures.
- System Uptime: Track and maintain high levels of system availability.
- Service Level Objectives (SLOs): Set and meet clear SLOs for reliability and performance.
- Service Level Indicators (SLIs): Define and monitor SLIs to ensure service quality.
Required Qualifications
- Bachelor’s or Master’s degree in Telecommunications, Computer Science, Electrical and Computer Engineering (ECE), or related field.
- 6+ years of proven experience in networking and data centre operations, particularly with recent HPC architectures, NetDevOps workflows, NVIDIA Air, and GNS3 simulations.
- 3+ years of experience as a Site Reliability Engineer or in a similar role.
- Expertise in networking technologies: TCP/UDP, IPv4/IPv6, BGP/MP-BGP, VPN, L2 switching, EVPN, VxLAN, SHARP, Segment Routing, BGP, MPLS, IS-IS, DWDM.
- In-depth knowledge of network protocols such as RoCE, RDMA, IBoE, and network topologies like Spine Leaf, Link/Super Spine Switching, and Fat-Free topology.
- Background in troubleshooting or testing server hardware/firmware, Linux OS, CLIs, and scripting.
- Excellent problem-solving and on-demand decision-making skills.
Desired Skills
- Certifications equivalent to CCIE, JNCIS, or InfiniBand NCP-IB.
- Experience with automated configuration management systems like Ansible and Terraform.
- Ability to handle high-pressure situations in HPC AI data centres.
- Strong collaboration skills with HPC solution architects and engineers.
This job description is not intended to be all-inclusive. Employees may perform other duties as assigned.