High-Performance Computing (HPC) Specialist - AI Training Infrastructure

Meta
London
GBP 60,000 - 100,000
Job description

We are seeking experienced and passionate High-Performance Computing (HPC) Specialists to join our AI Training Infrastructure team. In this role, you will design, optimize, and manage cutting-edge AI training environments for large-scale machine learning models. You will collaborate with a multidisciplinary team to ensure seamless integration and scalability across heterogeneous hardware platforms.

High-Performance Computing (HPC) Specialist - AI Training Infrastructure Responsibilities

  • Design and implement HPC solutions for large-scale AI/ML training workloads, ensuring high performance, scalability, and efficiency.
  • Optimize AI training pipelines and workflows to maximize utilization of GPUs and other specialized accelerators.
  • Analyze and troubleshoot hardware bottlenecks, network issues, and performance inefficiencies in large-scale AI training environments.
  • Collaborate with AI/ML researchers and data scientists to tailor HPC solutions that meet their specific model training requirements.
  • Develop monitoring and profiling systems to ensure efficient utilization of resources across heterogeneous systems.
  • Stay updated with advancements in HPC, AI/ML frameworks, and heterogeneous hardware technologies.
  • Contribute to documentation, best practices, and knowledge sharing within the team.

Minimum Qualifications

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related field.
  • 3+ years of experience in HPC environments, particularly for AI/ML workloads.
  • Proficiency in parallel programming, distributed systems, and HPC-specific libraries (e.g., MPI, OpenMP, CUDA, ROCm).
  • Hands-on experience with at least one hardware platform (e.g., NVIDIA GPUs, AMD GPUs, TPUs, FPGAs, or custom ASICs).
  • Familiarity with PyTorch.
  • Requires understanding of networked storage solutions, interconnects (e.g., InfiniBand, NVLink), and high-speed networking.
  • Past experience in optimizing resource utilization in multi-node training environments.
  • Problem-solving, communication, and collaboration skills.
Get a free, confidential resume review.
Select file or drag and drop it
Avatar
Free online coaching
Improve your chances of getting that interview invitation!
Be the first to explore new High-Performance Computing (HPC) Specialist - AI Training Infrastructure jobs in London