Enable job alerts via email!

HPC Engineer - Research Infrastructure

The Rundown AI, Inc.

Palo Alto (CA)

On-site

USD 120,000 - 180,000

Full time

7 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading technology company seeks a High-Performance Computing engineer to design and optimize systems for AI supercomputing clusters. This role requires deep technical understanding of hardware, software, and networking to enhance performance and scalability. You'll work alongside machine learning researchers and have a direct influence on AI advancements.

Qualifications

  • 8+ years as infrastructure engineer or DevOps in complex systems.
  • Deep understanding of networking, especially HPC networking.
  • Experience with high-quality software development in Python preferred.

Responsibilities

  • Manage training HPC clusters from provisioning to performance tuning.
  • Work on observability, distributed job tracing, and GPU diagnostics.
  • Impact company's ability to scale and achieve results in AI.

Skills

Networking
Problem-solving
Attention to detail
Distributed Systems

Job description

Help Luma build some of the biggest & fastest AI supercomputing clusters in the world! As a High-Performance Computing engineer, you’ll work at the intersection of hardware and software, designing systems that deliver the maximum possible performance for running large-scale AI models. We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud environment.

For this role, it’s important you understand how to combine CPU’s, GPU’s, and network devices into systems that are then deployed at a large scale to peak efficiency. You understand the lowest levels of the software platforms that sit on top of this hardware, including how to best optimize the Linux kernel and user-space code. You are capable of writing code to automate the monitoring and healing of these systems, commanding a large number of servers with few people.

Responsibilities

  • In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself.

  • We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve.

  • You’ll be managing training HPC clusters at Luma from provisioning to performance tuning.

  • Areas of work will include observability, distributed job tracing, GPU diagnostics, software environment management and additional tooling plus work on the actual code to enable necessary features.

  • We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

Experience

  • 8+ years experience as infrastructure engineer or Devops in large and complex distributed systems.

  • Deep understanding of networking, bonus points for experience in HPC networking.

  • Experience developing high-quality software in a general-purpose programming language, preferably including Python.

  • Excellent problem-solving skills and attention to detail.

  • Experience with GPUs in large scale clusters is strongly preferred.

  • Strong knowledge of observability and monitoring in distributed systems.

  • Tenacious at troubleshooting hardware and network topology failures in distributed systemsIndependently driven and able to own problems and build solutions from end-to-end.

  • Experience with large scale data center operations, proficiency in cloud orchestration and system tools.

Your application is reviewed by real people.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Research Engineer

Smart Justice California

Remote

USD 105.000 - 130.000

2 days ago
Be an early applicant

Research Engineer

Rhode Island Bar Assn.

Remote

USD 105.000 - 130.000

2 days ago
Be an early applicant

Systems Research Engineer, GPU Programming

CRM Hike

San Francisco

Remote

USD 160.000 - 230.000

30+ days ago

Senior Software Engineer - Distributed Systems & File Sync

Air, Inc.

Remote

USD 160.000 - 264.000

2 days ago
Be an early applicant

Sr Staff Research Engineer (Cortex Xpanse)

ZipRecruiter

Santa Clara

On-site

USD 136.000 - 200.000

2 days ago
Be an early applicant

Principal Applied Research Engineer / Scientist

AECOM

Cupertino

On-site

USD 120.000 - 160.000

Yesterday
Be an early applicant

Research Engineer, Data Infrastructure

Halodi Robotics

Palo Alto

On-site

USD 130.000 - 250.000

2 days ago
Be an early applicant

Research Engineer, World Models

Halodi Robotics

Palo Alto

On-site

USD 130.000 - 250.000

2 days ago
Be an early applicant

Research Engineer, Autonomy

Halodi Robotics

Palo Alto

On-site

USD 130.000 - 250.000

2 days ago
Be an early applicant