Enable job alerts via email!

AI/HPC Systems Production Engineer

Meta

London

On-site

GBP 50,000 - 90,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is seeking a skilled engineer to enhance its AI training and inference infrastructure. This role involves building and evolving network systems that connect training accelerators, ensuring they meet performance and reliability standards. As part of a multi-disciplinary team, you'll tackle challenges related to large-scale training systems, working closely with performance engineers to implement robust solutions. If you have a passion for cutting-edge technology and a background in networking and system software, this opportunity offers a chance to contribute to groundbreaking advancements in AI and immersive experiences.

Qualifications

  • 4+ years of experience in relevant fields with a strong foundation in Linux and networking.
  • Knowledge of RDMA technologies and experience with GPU frameworks is a plus.

Responsibilities

  • Ensure overall reliability of the communication system through monitoring and troubleshooting.
  • Develop and maintain CI/CD pipelines for training stack infrastructure.

Skills

Linux
Networking Principles
CI/CD
Communication Libraries
Troubleshooting

Education

BS in Electrical Engineering
MS in Computer Science
PhD in relevant fields

Tools

NVIDIA Collective Communication Library (NCCL)
CUDA
OpenCL
MPI

Job description

Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance, availability and reliability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across the stack: network fabric and host networking, communication libraries and scheduling infrastructure.

Responsibilities:

  1. Responsible for the overall reliability of the communication system, including monitoring, troubleshooting and proactive identification of production issues.
  2. Develop, extend and maintain CI/CD, testing pipelines for host components of training stack infrastructure, e.g. collective communication libraries (NCCL, RCCL), RDMA host stack dependencies.
  3. Active member of a multi-disciplinary team to develop solutions for large scale training systems. Work with performance engineers to ensure safe and robust rollout of new features.

Minimum Qualifications:

  1. BS/MS/PhD in relevant fields (EE, CS), with 4+ years work experience.
  2. Knowledge of Linux and foundational networking principles.

Preferred Qualifications:

  1. Experience working with up-to-date AI training workload packaging, CI/CD and distribution processes, containerization principles.
  2. Understanding of RDMA network stack principles and pain points on InfiniBand and RoCE Networks. Experience in development of systems and applications utilizing RDMA technologies.
  3. Experience with using communication libraries, such as MPI, NVIDIA Collective Communication Library (NCCL).
  4. Experience with GPU accelerator development frameworks, for example CUDA, OpenCL.
  5. Experience in developing and troubleshooting system level software.

About Meta:

Meta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today—beyond the constraints of screens, the limits of distance, and even the rules of physics.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

AI/HPC Systems Production Engineer

Meta

London

On-site

GBP 60,000 - 100,000

30+ days ago

Mechanical Engineers (Junior to Senior, Building Services) - Mostly Remote (Scotland)- 30-55K

JR United Kingdom

London

Remote

GBP 30,000 - 55,000

13 days ago

Product Designer (Engagement)

TryHackMe Ltd

Greater London

Remote

GBP 45,000 - 65,000

3 days ago
Be an early applicant

Digital Product Designer (6 month FTC)

Pulselive

London

Remote

GBP 40,000 - 55,000

5 days ago
Be an early applicant

DEVSECOPS ENGINEER, DEVELOPMENT BACKGROUND

ZipRecruiter

London

Remote

GBP 60,000 - 100,000

13 days ago

DevSecOps Engineer, Development Background, Microsoft Stack

Buy Local Sundridge Ltd

London

Remote

GBP 50,000 - 80,000

13 days ago

Solution Sales Engineer - Manufacturing Sector

ZipRecruiter

Manchester

Remote

GBP 80,000 - 100,000

3 days ago
Be an early applicant

Production Support Engineer- Systematic Quant Fund

Oxford Knight

London

On-site

GBP 50,000 - 70,000

Today
Be an early applicant

5G Core Network Development Engineer

TN United Kingdom

London

Remote

GBP 60,000 - 100,000

21 days ago