The Role:
We are seeking an experienced engineer in ML Training Infrastructure with a strong ability to execute hands-on technical work. In this role, you will be responsible for designing and building scalable, reliable, and high-performance AI/ML platform infrastructure to support advanced AI research and model development initiatives. As a Senior ML System Engineer, you will collaborate closely with machine learning engineers, research scientists, and other partners to develop state-of-the-art AI solutions that enable the future of intelligent driving technologies across General Motors vehicles.
What You’ll Do:
- Participate in the design and development of scalable, reliable, high-performance ML framework to support model training at scale.
- Participate in model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost.
- Raise the bar on system observability, debuggability, and operational excellence, and user experience.
- Collaborate with cross-functional teams to integrate new features and technologies into the platform.
Your Skills & Abilities (Required Qualifications)
- Bachelors or higher degree in Computer Science or equivalent major or equivalent experience
- 5+ years professional software engineering experience
- 2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
- Strong programming skills in Python, with proficiency in frameworks such as,PyTorch (preferred), TensorFlow, or similar
- Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure).
- Willingness to travel to Sunnyvale, CA as needed
- Comfortable working in highly ambiguous and dynamic environments
What Will Give You a Competitive Edge (preferred qualifications):
- Self-motivated, strong execution, impact-delivering oriented
- Extensive knowledge and experience with PyTorch 2.x+ and distributed training framework
- Experience with design and development of training framework that supports FSDP, Pipeline Parallelism and other scalable solutions to training large foundational models
- Experience with profiling, analysis, debugging and optimizing training and dataloading performance
- Experience with Apache Parquet, Apache Arrow, Ray, Ray Data
- Strong programming skills in C++
- Excellent communication skills to resolve controversial, make consensus, communicate risks and give constructive feedback
Compensation: The compensation information is a good faith estimate only. It is based on what a successful applicant might be paid in accordance with applicable state laws. The compensation may not be representative for positions located outside of the California Bay Area.
- The salary range for this role is $134,000 to $235,900. The actual base salary a successful candidate will be offered within this range will vary based on factors relevant to the position.
- Bonus Potential: An incentive pay program offers payouts based on company performance, job level, and individual performance.
Relocation: This job may be eligible for relocation benefits.
Benefits:
- GM offers a variety of health and wellbeing benefit programs. Benefit options include medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts and more.
Remote: This role is based remotely but if you live within a 50-mile radius of [Mountain View, Detroit, Warren, Milford], you are expected to report to that location three times a week, at minimum.