Enable job alerts via email!

HPC Site Reliability Engineer

asobbi

Town of Texas (WI)

On-site

USD 120,000 - 160,000

Full time

7 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading provider of HPC and advanced technology solutions is seeking a Site Reliability Engineer. This role involves managing high-performance computing environments, ensuring system reliability, and collaborating with teams on innovations to improve user experience. Candidates should hold relevant degrees and have extensive experience in networking and HPC architectures.

Qualifications

6+ years of proven experience in networking and data centre operations.
3+ years of experience as a Site Reliability Engineer or in a similar role.
Knowledge of network protocols.

Responsibilities

Maintain and optimise HPC infrastructure ensuring reliability and performance.
Write, execute, and debug Ansible Playbooks for automation.
Monitor system health and performance with observability tools.

Skills

Networking technologies

Problem-solving

Decision-making

Education

Bachelor’s or Master’s degree in Telecommunications, Computer Science, Electrical and Computer Engineering

Tools

Ansible

Terraform

This range is provided by asobbi. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

$120,000.00/yr - $160,000.00/yr

Direct message the job poster from asobbi

HPC, AI/ML, Tech Scale-Ups, PE & VC - Europe & USA

Company Overview: Our client is a leading provider of HPC and advanced technology solutions, specialising in AI infrastructure. They offer customisable cloud solutions designed to support AI teams at every stage of their projects.

Job Purpose

As an HPC SRE, you will manage, optimise, and ensure the reliability of the high-performance computing environments. You will be the technical expert for the HPC infrastructure, covering system architecture, optimisation, integrations, and networking. Collaborating with cross-functional teams, you will drive innovations that align with business goals and enhance user experiences. This role demands 24/7 support to maintain high availability and performance of HPC systems.

Key Responsibilities

Infrastructure Management

Maintain and optimise HPC infrastructure, ensuring the reliability and performance of Nvidia-based systems.
Set up HPC clusters with DGX or HGX platforms, GPU Direct, and establish network optimisation for server-to-storage or storage-to-storage connectivity, including multi-cloud and WAN HPC interconnectivity.
Configure, troubleshoot, and quickly resolve issues with Networking R&S hardware from vendors like Cisco and Juniper.

Automation and Efficiency

Write, execute, and debug Ansible Playbooks for Cumulus Linux automation.
Utilise and maintain automated configuration management systems such as Ansible and Terraform.
Lead investigations into high-priority incidents, identify solutions, and prepare Root Cause Analysis (RCA).
Proactively monitor data centre health checks, licensing, and life-cycle management upgrades.
Provide 24/7 support through on-call rotations, ensuring continuous availability and rapid incident response.

Monitoring and Observability

Use observability metrics tools like Grafana Cloud, ELK, NVIDIA UFM, NetQ, and QoS metrics to monitor system health and performance.
Develop and implement monitoring strategies to ensure high availability and performance of HPC systems.

Collaboration and Communication

Collaborate with HPC solution architects and engineers to drive innovation and optimization.
Provide regular reports on P1/P2 incidents, RCAs, life-cycle upgrades, and change/incident management actions to senior management.
Maintain comprehensive documentation of infrastructure audits and policy changes.

Key Objectives and Goals

Reliability: Achieve and maintain high availability and uptime for HPC systems.

Performance: Continuously optimise the performance of Nvidia-based and other HPC systems.

Scalability: Develop scalable HPC solutions to support ongoing business growth.

Automation: Increase the level of automation to enhance efficiency and reduce manual tasks.

Continuous Availability: Ensure 24/7 support through effective coverage and on-call practices.

Collaboration: Foster a collaborative environment within the SRE teams and with other departments.

Continuous Improvement: Promote a culture of ongoing learning and improvement.

Required Qualifications

Bachelor’s or Master’s degree in Telecommunications, Computer Science, Electrical and Computer Engineering (ECE), or related field.
6+ years of proven experience in networking and data centre operations, particularly with recent HPC architectures, NetDevOps workflows, NVIDIA Air, and GNS3 simulations.
3+ years of experience as a Site Reliability Engineer or in a similar role.
Expertise in networking technologies.
Knowledge of network protocols.
Background in troubleshooting or testing server hardware/firmware, Linux OS, CLIs, and scripting.
Excellent problem-solving and on-demand decision-making skills.

Desired Skills

Experience with automated configuration management systems like Ansible and Terraform.
Ability to handle high-pressure situations in HPC AI data centres.
Strong collaboration skills with HPC solution architects and engineers.

Seniority level

Seniority level
Mid-Senior level

Employment type

Employment type
Full-time

Job function

Job function
Information Technology and Engineering
Industries
IT Services and IT Consulting

Referrals increase your chances of interviewing at asobbi by 2x

Site Reliability Engineer (SRE, Remote US)

Austin, TX $120,000.00-$160,000.00 3 months ago

Site Reliability Engineer (FULLY REMOTE)

Senior Site Reliability Engineer (SRE) - REMOTE

Austin, TX $175,000.00-$200,000.00 1 month ago

Austin, TX $85,000.00-$95,000.00 5 days ago

Dallas, TX $80,000.00-$125,000.00 3 days ago

United States $130,000.00-$140,000.00 2 days ago

Software and Documentation Engineer (Remote)

Austin, TX $83,200.00-$156,000.00 2 weeks ago

Dallas, TX $120,000.00-$180,000.00 5 hours ago

Site Reliability Engineer-FedRAMP (FULLY REMOTE)

Senior Site Reliability Engineer (SRE) - REMOTE

Austin, Texas Metropolitan Area 3 days ago

DevOps Software Engineer (Remote - United States)

Houston, TX $120,000.00-$180,000.00 5 hours ago

Software Engineer, Devices - United States

Austin, TX $140,000.00-$157,000.00 1 day ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs