Enable job alerts via email!

Software Engineer

Meta

City Of London

On-site

GBP 70,000 - 90,000

Full time

29 days ago

Job summary

A leading internet company based in London is seeking an experienced Software Engineer to optimize AI models on custom accelerators. This role involves enhancing training throughput and improving inference efficiency through innovative optimization methods. Candidates should have strong experience in C/C++ and AI frameworks. If you have a passion for developing cutting-edge AI solutions, apply now.

Qualifications

  • 6+ years of experience developing and optimizing performance in modern C/C++.
  • Experience programming AI accelerators using frameworks such as PyTorch.
  • Experience developing custom kernels to improve performance.

Responsibilities

  • Co-design models to maximize efficiency in pre-training and inference.
  • Drive state-of-the-art optimization techniques on Meta’s AI workloads.
  • Optimize large-scale workloads on training superclusters.

Skills

Programming AI accelerators
C/C++ optimization
Debugging large-scale workloads
Managing model performance

Education

Bachelor’s degree in computer science or related STEM field

Tools

CUDA
PyTorch
Job description
Summary

Meta is seeking a Software Engineer to join our team. The ideal candidate is someone with experience working on maximizing performance of AI models on GPUs or custom silicon. This role involves applying these skills to solve some of the most crucial and exciting problems that exist on the web. The AI Applications Engineering team is dedicated to maximizing training and inference performance of Generative AI (GenAI) and Recommendation models on Meta's Training and Inference Accelerator (MTIA). We employ innovative optimization and parallelization strategies to maximize training throughput for the next generations of GenAI and recommendation models. Additionally, we work cross-functionally with many partner teams to ensure end-to-end performance of large-scale pre-training and inference, enabling us to deliver the next generation of AI experiences more quickly to our users.

Required Skills

Software Engineer Responsibilities:

  1. Work cross-functionally to co-design models to maximize pre-training and inference efficiency

  2. Applying and driving state-of-the-art optimization techniques to our latest large-scale AI workloads running on Meta’s fleet of accelerators including functional development and maintenance

  3. Profiling, analyzing, debugging, and optimizing large-scale workloads on our next-generation training superclusters

  4. Optimization of the underlying processes of the whole vertical stack, from kernels, framework, communication, and firmware to layers and hyperparameters

  5. Set direction and goals for the team related to project impact, capacity, and developer efficiency

  6. Lead large and complex technical efforts across many engineers and teams from zero to one

Minimum Qualifications
  1. Bachelor’s degree in computer science or a related STEM field

  2. Experience programming AI accelerators (e.g. GPUs, custom silicon etc.) using AI frameworks such as PyTorch or similar

  3. Experience developing custom kernels and compiler infrastructure to improve performance using low-level programming models such as CUDA, OpenCL or similar

  4. Minimum 6+ years of experience developing and optimizing performance in modern C/C++

  5. Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment

Preferred Qualifications
  1. Experience with training and validating large-scale AI models, including parallelising models across several accelerators

  2. Understanding of multiprocessing, including race conditions and communications between processes

  3. Experience of evaluating model performance, e.g., with profilers and tuning hyperparameters

  4. Thorough understanding of model and data parallelisms such as FSDP, tensor parallelism, model parallelism, expert parallelism, etc

  5. Demonstrated experience of the model life cycle from pre-training and post-training to inference, dataset splits and shuffling, metrics, especially for large language models

  6. Experience of developing, optimizing and validating kernels on GPUs or other accelerators

Industry: Internet

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.