Enable job alerts via email!

Software Engineer, ML Infrastructure - Training Platform

Scale AI, Inc.

California, San Francisco (MO, CA)

On-site

USD 160,000 - 226,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a forward-thinking company as an AI/ML Infrastructure Engineer, where you'll enhance experimentation throughput by collaborating with Machine Learning researchers. This role involves designing and building cost-effective APIs for model training, managing end-to-end projects, and ensuring service availability. With a focus on machine learning fundamentals and backend system design, you'll work in a dynamic environment that values innovation and collaboration. If you're passionate about AI and eager to make a significant impact, this opportunity is perfect for you.

Benefits

Health Insurance
Dental Insurance
Vision Insurance
Retirement Plan
Learning Stipends
Generous PTO
Commuter Stipends

Qualifications

  • 4+ years of experience with ML training pipelines or inference services.
  • Experience with distributed training techniques and microservice architectures.

Responsibilities

  • Build APIs for model training that are highly available and performant.
  • Manage projects from requirements to implementation in a collaborative environment.

Skills

Machine Learning Fundamentals
Backend System Design
ML Infrastructure
Python
Docker
Kubernetes
Infrastructure as Code (Terraform)

Tools

DeepSpeed
FSDP
AWS
GCP

Job description

Scale is seeking an AI/ML Infrastructure Engineer to join our Machine Learning Infrastructure team to develop our Training Platform. In this role, you will collaborate closely with Machine Learning researchers to understand their needs and leverage your expertise and our compute resources to enhance experimentation throughput.


The ideal candidate should possess strong fundamentals in machine learning, backend system design, and prior experience in ML Infrastructure. Comfort with infrastructure, large-scale system design, and diagnosing model performance and system failures is essential.


You will:
  • Build highly available, observable, performant, and cost-effective APIs for model training.
  • Participate in our on-call process to ensure service availability.
  • Manage projects end-to-end, from requirements and scoping to design and implementation, within a collaborative, cross-functional environment.
  • Exercise good judgment in system and tool building, balancing build vs. buy decisions with cost considerations.

Ideally you'd have:
  • 4+ years of experience with machine learning training pipelines or inference services in production.
  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience developing, deploying, and monitoring complex microservice architectures.
  • Proficiency in Python, Docker, Kubernetes, and Infrastructure as Code (e.g., Terraform).

Nice to haves:
  • Experience with LLM inference latency optimization techniques like kernel fusion, quantization, dynamic batching, etc.
  • Experience working with cloud platforms such as AWS or GCP.

Compensation packages include base salary, equity, and benefits. The salary range varies by location and other factors. Benefits include health, dental, vision, retirement, learning stipends, and generous PTO. Additional benefits may include commuter stipends.


Location-specific salary range in San Francisco, New York, Seattle: $160,000 — $225,600 USD.


Note: Our policy requires a 90-day waiting period before reconsidering candidates for the same role.


About Us:

At Scale, we aim to accelerate the transition to AI across industries. Our products power advanced LLMs, generative models, and computer vision models, trusted by leading AI companies and organizations worldwide. We promote an inclusive workplace and are committed to equal opportunity employment. For accommodations during the application process, contact accommodations@scale.com.


We adhere to the US Department of Labor's Pay Transparency and privacy policies. Personal data collected is used solely for employment-related purposes and managed according to our privacy policy.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.