Enable job alerts via email!

Staff Software Engineer, ML Training Platform

Reddit, Inc.

United States

Remote

USD 230,000 - 322,000

Full time

2 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading tech company is seeking a Staff Software Engineer for their Machine Learning Training Platform team. This pivotal role involves architecting and maintaining ML infrastructure that drives content recommendations and user engagement. With a focus on GPU optimization and high-performance solutions, the candidate will influence massive ML operations and drive advancements that enhance user experiences.

Benefits

Comprehensive Healthcare Benefits
401k Match
Family Planning Support
Mental Health & Coaching Benefits
Flexible Vacation
Paid Volunteer time off

Qualifications

  • 8+ years of experience in software development or data systems.
  • Experience with distributed training and optimization methods.
  • Hands-on experience with ML training workflows.

Responsibilities

  • Lead the design and maintenance of ML infrastructure.
  • Optimize systems for performance and cost-efficiency.
  • Mentor team members in DevOps practices.

Skills

GPU optimization
Software development
MLOps practices
Performance optimization
Container orchestration

Tools

Kubernetes
Docker
Ray
MLFlow
Deepspeed

Job description

Staff Software Engineer, ML Training Platform

Remote - United States

Reddit is a community of communities. It’s built on shared interests, passion, and trust and is home to the most open and authentic conversations on the internet. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 101M+ daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit redditinc.com .

Location:
This role is completely remote-friendly . If you happen to live close to one of our physical office locations, our doors are open for you to come into the office as often as you'd like.

Who We Are: The Machine Learning Platform team at Reddit is a high-impact team that owns the infrastructure that powers recommendations, content discovery, user and content quantification, while directly impacting other teams such as Growth, Ads, Feeds, and Core Machine Learning teams.

What You’ll Do:As a Staff Software Engineer, ML Training Platform, this person will be instrumental in architecting, implementing, and maintaining foundational Machine Learning Training infrastructure that powers Feeds Ranking, Content Understanding, Recommendations and much more to fulfill Reddit’s mission of bringing community and belonging to everyone in the world. You will oversee GPU optimization in AI/ML batch workloads and debug performance bottlenecks in GPU workloads. You will build and own systems and tools that enable MLEs and data scientists, and continuously improve the ML software development lifecycle.

  • Lead the building, testing, and maintenance of ML infrastructure at Reddit
  • Propose, design, and implement high-performance ML platform solutions that significantly advance the deployment of models that serve millions of redditors a seamless experience for MLEs
  • Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
  • Analyze bottlenecks in distributed systems and optimize for performance and cost-efficiency
  • Work with management on team goal setting, planning, and de-risk project execution
  • Mentor other team members in adopting a rigorous DevOps approach to maintain and/or improve ML platform components and services health and quality

Who You Might Be:

  • 8+ years of work experience in a production software development environment or building data systems
  • Experience with XLA for Tensorflow or torch.inductor for pytorch for kernel fusion during training
  • Experience with optimization of data workloads using collosal.AI or Deepspeed
  • Experience with distributed Training optimization using deepspeed, horovod or collosalAI
  • Experience with design and architecture of large scale ML Systems
  • Experience with training workflows, hyperparameter tuning, and resource optimization on CPU and GPU
  • Experience with MLOps practices and tools such as Ray and MLFlow
  • Hands-on experience with Kubernetes, Docker, or other container orchestration systems
  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k Match
  • Family Planning Support
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Reddit Global Days off
  • Paid Volunteer time off

Pay Transparency:

This job posting may span more than one career level.

In addition to base salary, this job is eligible to receive equity in the form of restricted stock units, and depending on the position offered, it may also be eligible to receive a commission. Additionally, Reddit offers a wide range of benefits to U.S.-based employees, including medical, dental, and vision insurance, 401(k) program with employer match, generous time off for vacation, and parental leave. To learn more, please visit https://www.redditinc.com/careers/ .

To provide greater transparency to candidates, we share base pay ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar stage growth companies. Final offer amounts are determined by multiple factors including, skills, depth of work experience and relevant licenses/credentials, and may vary from the amounts listed below.

The base pay range for this position is:

$230,000 - $322,000 USD

Reddit is proud to be an equal opportunity employer, and is committed to building a workforce representative of the diverse communities we serve. Reddit is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If, due to a disability, you need an accommodation during the interview process, please let your recruiter know.

Apply for this job

*

indicates a required field

First Name *

Last Name *

Preferred First Name

Email *

Phone *

Location (City) *

Resume/CV *

Enter manually

Accepted file types: pdf, doc, docx, txt, rtf

Enter manually

Accepted file types: pdf, doc, docx, txt, rtf

In 1-2 sentences, how have you contributed to an ML Training Platform in the past? *

In 1-2 sentences, have you worked on optimizing model Training on GPUs? What techniques did you use to increase throughput and improve efficiency? *

LinkedIn Profile

How did you hear about this job? *

Are you currently authorized to work in the U.S.? * Select...

Do you now, or will you in the future, require immigration sponsorship to work at Reddit? * Select...

By selecting "I agree," I understand that the information I have provided as part of this job application will be processed in accordance with Reddit's Candidate Privacy Policy. * Select...

Please provide the name of your current (or most recent) company *

Reddit U.S. Equal Employment Information

Reddit is proud to be an equal opportunity employer, and is committed to building a workforce representative of the diverse communities we serve. To bring community and belonging to everyone in the world, Reddit’s employees must represent communities and redditors on our platform.

Our vision at Reddit is to have a workforce representative of people with different perspectives and experiences, including but not limited to, gender, race and ethnicity, sexual orientation, age, national origin, religion, and political views.

We invite you to self-identify across the identities below so we can better understand our talent pools and assess our effectiveness in attracting and recruiting people to Reddit from all backgrounds.

Answering these questions will not impact your application, nor will this information be shared with anyone making a hiring decision. For more information, please refer to our statement here .

What gender identity do you most closely identify with? * Select...

Are you a person of transgender experience? * Select...

What sexual orientation do you most closely identify with? * Select...

Do you live with a disability (as outlined by the ADA)? * Select...

Are you a veteran/have you served in the military? * Select...

Please select up to 2 ethnicities that you most closely identify with. * Select...

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.