Enable job alerts via email!

Senior HPC Performance Engineer

NVIDIA

United States

Remote

USD 120,000 - 160,000

Full time

14 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

NVIDIA is seeking a Performance Engineer to enhance communication libraries crucial for Deep Learning and HPC applications. The role involves performance analysis on multi-GPU systems, troubleshooting, and collaborating across teams. Ideal candidates have a strong background in HPC, parallel programming, and performance engineering, along with a passion for innovative technology.

Benefits

Highly competitive salaries
Extensive benefits
Diversity and inclusion initiatives
Flexible work environment

Qualifications

  • 3+ years of experience with parallel programming and communication runtimes.
  • Experience conducting performance benchmarking on large-scale HPC clusters.
  • Good understanding of computer architecture and operating systems.

Responsibilities

  • Conduct performance characterization and analysis on multi-GPU and multi-node clusters.
  • Triage and diagnose performance issues reported by customers.
  • Develop tools to visualize and analyze performance data.

Skills

Parallel programming
Performance benchmarking
Debugging performance issues
Scripting (Python)
Adaptability

Education

M.S. or PhD in Computer Science

Tools

Kubernetes
SLURM
Ansible
Docker

Job description

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

Come work for the team that brought you NCCL, NVSHMEM, & GPUDirect. Our GPU communication libraries are crucial for scaling Deep Learning and HPC applications! We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have huge compute demands and run on scales up to tens of thousands of GPUs. These GPUs are connected via high-speed interconnects (e.g., NVLink, PCIe) within a node and high-speed networking (e.g., Infiniband, Ethernet) across nodes. Communication performance between GPUs directly impacts application performance, especially at large scales. This is an outstanding opportunity for someone with HPC and performance expertise to advance the state of the art in this space. Are you ready to contribute to innovative technologies and help realize NVIDIA's vision?

What you will be doing:
  1. Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  2. Study the interaction of our libraries with all hardware (GPU, CPU, Networking) and software components in the stack.
  3. Evaluate proof-of-concept solutions and conduct trade-off analyses when multiple options are available.
  4. Triage and diagnose performance issues reported by customers.
  5. Collect performance data; develop tools and infrastructure to visualize and analyze this information.
  6. Collaborate with a dynamic team across multiple time zones.
What we need to see:
  1. M.S. (or equivalent experience) or PhD in Computer Science or related field with relevant performance engineering and HPC experience.
  2. 3+ years of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM).
  3. Experience conducting performance benchmarking and troubleshooting on large-scale HPC clusters.
  4. Good understanding of computer system architecture, hardware-software interactions, and operating systems fundamentals.
  5. Ability to implement micro-benchmarks in C/C++ and modify code bases as needed.
  6. Proficiency in debugging performance issues across the entire hardware/software stack; proficiency in scripting languages, preferably Python.
  7. Familiarity with containers, cloud provisioning, and scheduling tools (Kubernetes, SLURM, Ansible, Docker).
  8. Adaptability and passion for learning new tools and areas; ability to work effectively across teams and time zones.
Ways to stand out from the crowd:
  1. Practical experience with Infiniband/Ethernet networks, RDMA, topologies, and congestion control.
  2. Experience debugging network issues in large-scale deployments.
  3. Familiarity with CUDA programming and/or GPUs.
  4. Experience with Deep Learning frameworks like PyTorch and TensorFlow.

NVIDIA is at the forefront of breakthroughs in Artificial Intelligence, High-Performance Computing, and Visualization. Our teams are composed of driven, innovative professionals dedicated to pushing the boundaries of technology. We offer highly competitive salaries, extensive benefits, and a work environment that promotes diversity, inclusion, and flexibility. As an equal opportunity employer, we are committed to fostering a supportive and empowering workplace for all.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Performance Engineer

Veeva Systems

Bend

Remote

USD 120,000 - 220,000

Today
Be an early applicant

Senior Performance Engineer

Veeva Systems

Portland

Remote

USD 120,000 - 220,000

Today
Be an early applicant

Sr. Performance Engineer

Dayforce US, Inc.

Minnesota

Remote

USD 100,000 - 150,000

2 days ago
Be an early applicant

Software Engineer - Performance & Resilience

TechBrains

Remote

USD 90,000 - 130,000

Today
Be an early applicant

Senior Performance Engineer

Veeva Systems

Boston

Remote

USD 120,000 - 220,000

Today
Be an early applicant

Senior Performance Engineer

Veeva Systems

Portland

Remote

USD 120,000 - 220,000

Today
Be an early applicant

AI/ML Application Performance Engineer

Cornelis Networks

Chesterbrook

Remote

USD 127,000 - 184,000

10 days ago

Senior Web Performance Engineer - Remote ($150-$250K)

CyberCoders

San Francisco

Remote

USD 150,000 - 250,000

21 days ago

Senior Performance Engineer – Load Testing & System Optimization (Remote)

Cognizant

St. Louis

Remote

USD 83,000 - 132,000

21 days ago