High-Performance Computing (HPC) Specialist - AI Training Infrastructure
Meta
London
GBP 60,000 - 100,000
Job description
We are seeking experienced and passionate High-Performance Computing (HPC) Specialists to join our AI Training Infrastructure team. In this role, you will design, optimize, and manage cutting-edge AI training environments for large-scale machine learning models. You will collaborate with a multidisciplinary team to ensure seamless integration and scalability across heterogeneous hardware platforms.
High-Performance Computing (HPC) Specialist - AI Training Infrastructure Responsibilities
Design and implement HPC solutions for large-scale AI/ML training workloads, ensuring high performance, scalability, and efficiency.
Optimize AI training pipelines and workflows to maximize utilization of GPUs and other specialized accelerators.
Analyze and troubleshoot hardware bottlenecks, network issues, and performance inefficiencies in large-scale AI training environments.
Collaborate with AI/ML researchers and data scientists to tailor HPC solutions that meet their specific model training requirements.
Develop monitoring and profiling systems to ensure efficient utilization of resources across heterogeneous systems.
Stay updated with advancements in HPC, AI/ML frameworks, and heterogeneous hardware technologies.
Contribute to documentation, best practices, and knowledge sharing within the team.
Minimum Qualifications
Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related field.
3+ years of experience in HPC environments, particularly for AI/ML workloads.
Proficiency in parallel programming, distributed systems, and HPC-specific libraries (e.g., MPI, OpenMP, CUDA, ROCm).
Hands-on experience with at least one hardware platform (e.g., NVIDIA GPUs, AMD GPUs, TPUs, FPGAs, or custom ASICs).
Familiarity with PyTorch.
Requires understanding of networked storage solutions, interconnects (e.g., InfiniBand, NVLink), and high-speed networking.
Past experience in optimizing resource utilization in multi-node training environments.
Problem-solving, communication, and collaboration skills.