Enable job alerts via email!

Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)

TrueFoundry

San Mateo (CA)

On-site

USD 167,000 - 251,000

Full time

8 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a leading company to work as a Staff ML Platform Engineer, focusing on scaling deep learning workloads and optimizing multi-GPU training. Collaborate with ex-Facebook engineers and contribute to an experimental culture. Enjoy flexible hours and learning opportunities.

Benefits

Flexible hours

Learning credits

Qualifications

5+ years of hands-on experience building ML systems at scale.
Deep experience with multi-GPU/multi-node training.

Responsibilities

Write clean, modular, and scalable Python code.
Build platform for training and finetuning large-scale ML models.

Skills

Python

Kubernetes

PyTorch

ML Systems Engineering

Tools

Kubeflow

TensorRT

Join to apply for the Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) role at TrueFoundry

2 weeks ago Be among the first 25 applicants

Join to apply for the Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps) role at TrueFoundry

Get AI-powered advice on this job and more exclusive features.

Build the Future of Scalable AI at TrueFoundry

At TrueFoundry, we're redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You'll Work On

Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance.
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
Build platform for developing, deploying and evaluating agentic applications for our end customers.
Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We're Looking For

5+ years of hands-on experience building and deploying ML systems at scale.
5+ years of writing production quality high performance code.
Deep experience with multi-GPU/multi-node training, ideally with PyTorch as your primary framework.
Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).
Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.
A pragmatic mindset—you know when to optimize and when to ship.
Bonus: Familiarity with open-source LLM training/fine-tuning.

Why Join TrueFoundry?

Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni.
First-hand exposure to building and scaling a deep-tech startup—insights you'll carry if you want to start your own one day.
Be part of a fearlessly experimental culture focused on customer success and long-term impact.
Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

Seniority level

Seniority level
Mid-Senior level

Employment type

Employment type
Full-time

Job function

Job function
Engineering and Information Technology
Industries
Software Development

Referrals increase your chances of interviewing at TrueFoundry by 2x

Software Engineer, AI Platform - New Grad

Mountain View, CA $167,200.00-$250,800.00 1 week ago

Mountain View, CA $167,200.00-$250,800.00 1 day ago

San Francisco, CA $150,000.00-$300,000.00 8 months ago

Member of Technical Staff AI Platform Engineer

San Francisco, CA $150,000.00-$300,000.00 8 months ago

Member of Technical Staff Platform Engineer

Mountain View, CA $117,200.00-$294,000.00 2 weeks ago

Founding Engineer - Up to $200K + Equity

San Francisco, CA $150,000.00-$200,000.00 2 weeks ago

San Francisco, CA $75,000.00-$95,000.00 2 days ago

San Francisco, CA $150,000.00-$200,000.00 5 months ago

Site Reliability Engineer, AI/ML Platforms

San Mateo, CA $195,000.00-$255,000.00 5 months ago

San Francisco, CA $180,000.00-$340,000.00 2 weeks ago

Sr. Software Engineer, ML Platform - Slack

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs