Enable job alerts via email!

[PhD] ML Infrastructure Engineer - Distributed Training, AWS Neuron, Annapurna Labs

Amazon Web Services (AWS)

Seattle (WA)

On-site

USD 120,000 - 180,000

Full time

24 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company in cloud solutions is seeking a PhD ML Infrastructure Engineer for its Distributed Training team. This role involves developing and optimizing machine learning models on AWS's specialized AI hardware. The ideal candidate will have a PhD in a relevant field, proficiency in C++ and Python, and experience with ML frameworks like PyTorch and JAX. Join us to work on cutting-edge technologies that help shape the future of machine learning.

Qualifications

PhD earned or expected between December 2023 and September 2025.
Proficiency in C++ and Python.
Experience with ML frameworks, especially PyTorch and/or JAX.

Responsibilities

Develop and improve distributed training capabilities in frameworks like PyTorch and JAX.
Work with compiler and runtime teams to optimize ML models for AWS's custom chips.
Bridge the gap between ML frameworks and hardware acceleration.

Skills

C++

Python

Parallel Computing

CUDA Programming

Distributed Systems

Education

PhD in relevant field

Tools

PyTorch

JAX

vLLM

TensorRT

Job Description

Join to apply for the [PhD] ML Infrastructure Engineer - Distributed Training, AWS Neuron, Annapurna Labs role at Amazon Web Services (AWS).

By applying to this position, your application will be considered for all locations we hire for in the United States.

About Annapurna Labs

Annapurna Labs designs silicon and software that accelerates innovation. Customers choose us to create cloud solutions that solve challenges that were unimaginable a short time ago. Our custom chips, accelerators, and software stacks enable us to take on technical challenges that have never been seen before, delivering results that help our customers change the world.

About AWS Neuron

AWS Neuron is the complete software stack for the AWS Trainium and Inferentia cloud-scale Machine Learning accelerators. This role is for a Senior Machine Learning Engineer in the Distributed Training team for AWS Neuron, responsible for development, enablement, and performance tuning of various ML model families, including large-scale LLMs like GPT and Llama, as well as Stable Diffusion, Vision Transformers, and more.

Responsibilities

Develop and improve distributed training capabilities in frameworks like PyTorch and JAX on AWS's specialized AI hardware.
Work with compiler and runtime teams to optimize ML models for AWS's custom chips (Trainium and Inferentia).
Bridge the gap between ML frameworks and hardware acceleration, building strong foundations in distributed systems.

Qualifications

Basic:

PhD earned or expected between December 2023 and September 2025.
Proficiency in C++ and Python.
Experience with ML frameworks, especially PyTorch and/or JAX.
Understanding of parallel computing and CUDA programming.

Preferred:

Open source contributions or research publications in ML frameworks, tools, compilers, or distributed computing.
Experience optimizing ML workloads for performance.
Experience with PyTorch internals or CUDA optimization.
Hands-on experience with LLM infrastructure tools like vLLM, TensorRT.

Amazon is an equal opportunity employer and values diversity. We welcome applicants from all backgrounds and experiences.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs