Aktiviere Job-Benachrichtigungen per E-Mail!

Senior HPC Performance Engineer

NVIDIA Corporation

Deutschland

Remote

EUR 70.000 - 100.000

Vollzeit

Gestern
Sei unter den ersten Bewerbenden

Zusammenfassung

A leading tech firm in Germany is looking for a Senior HPC Performance Engineer to advance communication libraries for deep learning and HPC applications. The role involves performance analysis, benchmarking, and collaboration across various teams. Ideal candidates will have an M.S. or PhD in Computer Science and substantial experience in performance engineering and parallel programming, especially with HPC clusters. A competitive salary and diverse benefits are offered.

Leistungen

Highly competitive salaries
Extensive benefits package
Diversity and inclusion programs

Qualifikationen

  • 3+ years of experience with parallel programming and communication runtimes.
  • Experience conducting performance benchmarking on large HPC clusters.
  • Good understanding of computer system architecture and OS principles.

Aufgaben

  • Conduct in-depth performance characterization on multi-GPU clusters.
  • Evaluate and analyze performance data to identify issues.
  • Collaborate effectively across different teams and time zones.

Kenntnisse

Parallel programming
Performance engineering
Scripting (Python)
Debugging
Adaptability

Ausbildung

M.S. or PhD in Computer Science or related field

Tools

MPI
CUDA
Kubernetes
Docker
SLURM
Ansible

Jobbeschreibung

Senior HPC Performance Engineer page is loaded

Senior HPC Performance Engineer
Apply locations Germany, Remote Poland, Remote UK, Remote Switzerland, Remote time type Full time posted on Posted Yesterday job requisition id JR2001201

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars.

Come work for the team that brought to you NCCL, NVSHMEM & GPUDirect. Our GPU communication libraries are crucial for scaling Deep Learning and HPC applications!We are looking for a motivated Performance engineer to influence the roadmap of our communication libraries. The DL and HPC applications of today have a huge compute demand and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! This is an outstanding opportunity for someone with HPC and performance background to advance the state of the art in this space. Are you ready for to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:
  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.

  • Study the interaction of our libraries with all HW (GPU, CPU, Networking) and SW components in the stack

  • Evaluate proof-of-concepts, conduct trade-off analysis when multiple solutions are available

  • Triage and root-cause performance issues reported by our customers

  • Collect a lot of performance data; build tools and infrastructure to visualize and analyze the information

  • Collaborate with a very dynamic team across multiple time zones

What we need to see:
  • M.S. (or equivalent experience) or PHD in Computer Science, or related field with relevant performance engineering and HPC experience

  • 3+ yrs of experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)

  • Experience conducting performance benchmarking and triage on large scale HPC clusters

  • Good understanding of computer system architecture, HW-SW interactions and operating systems principles (aka systems software fundamentals)

  • Implement micro-benchmarks in C/C++, read and modify the code base when required

  • Ability to debug performance issues across the entire HW/SW stack. Proficient in a scripting language, preferably Python

  • Familiar with containers, cloud provisioning and scheduling tools (Kubernetes, SLURM, Ansible, Docker)

  • Adaptability and passion to learn new areas and tools. Flexibility to work and communicate effectively across different teams and timezones

Ways to stand out from the crowd:
  • Practical experience with Infiniband/Ethernet networks in areas like RDMA, topologies, congestion control

  • Experience debugging network issues in large scale deployments

  • Familiarity with CUDA programming and/or GPUs

  • Experience with Deep Learning Frameworks such PyTorch, TensorFlow

NVIDIA is at the forefront of breakthroughs in Artificial Intelligence, High-Performance Computing, and Visualization. Our teams arecomposed of driven, innovative professionals dedicated to pushing the boundaries of technology. We offer highly competitive salaries, an extensive benefits package, and a work environment that promotes diversity, inclusion, and flexibility. As an equal opportunity employer, we are committed to fostering a supportive and empowering workplace for all.

Similar Jobs (2)
Senior HPC DevOps Engineer
locations 6 Locations time type Full time posted on Posted 30+ Days Ago
Senior HPC AI Cluster Engineer
locations 5 Locations time type Full time posted on Posted 2 Days Ago

NVIDIA is the world leader in accelerated computing.

NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society.

Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.
eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.