Enable job alerts via email!

Staff Machine Learning Infrastructure Engineer

Dyna Robotics

Redwood City (CA)

On-site

USD 120,000 - 180,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a forward-thinking company at the cutting edge of robotics as a Machine Learning Infrastructure Engineer. In this exciting role, you'll design and maintain large-scale ML infrastructure, optimizing high-performance computing systems to accelerate model training. Collaborate with top-tier researchers and engineers to push the boundaries of robotic manipulation and AI-driven solutions. If you're passionate about building scalable systems and thrive in a dynamic environment, this opportunity is perfect for you. Help shape the future of intelligent robotics and make a significant impact in an innovative field.

Qualifications

  • 7+ years in software, with 2+ years in a tech lead role.
  • Experience with high-performance computing and distributed systems.
  • Hands-on with cloud GPU environments and job scheduling.

Responsibilities

  • Design and implement large-scale ML training pipelines.
  • Optimize distributed computing solutions for training efficiency.
  • Collaborate with ML researchers to enhance system performance.

Skills

High-performance computing
Distributed systems
Cloud GPU management
Job scheduling systems
ML model tuning
Analytical skills
Problem-solving skills
Communication skills

Education

Bachelor’s degree in Computer Science
Master’s degree in a related field

Tools

GCP
AWS
PyTorch
TensorRT
Triton
Accelerate
Kubernetes

Job description

Staff Machine Learning Infrastructure Engineer

Company Overview:

Dyna Robotics is at the forefront of revolutionizing robotic manipulation with cutting-edge foundation models. Our mission is to empower businesses by automating repetitive, stationary tasks with affordable, intelligent robotic arms. Leveraging the latest advancements in foundation models, we're driving the future of general-purpose robotics—one manipulation skill at a time.

Dyna Robotics was founded by industry leaders who previously achieved a $350 million exit in grocery deep tech as well as top robotics researchers from DeepMind and Nvidia. Our team blends world-class research, engineering, and product innovation to drive the future of robotic manipulation. With $20mil+ in funding, we're positioned to redefine the landscape of robotic automation. Join us to shape the next frontier of AI-driven robotics.

Position Overview:

We are seeking an experienced Machine Learning Infrastructure Engineer to join our team and help scale our ML training platform. In this role, you will be responsible for designing, implementing, and maintaining large-scale ML infrastructure to accelerate model iteration and improve training performance across an expanding GPU ecosystem. You will work on cutting-edge high-performance computing systems, optimizing distributed training environments, and ensuring system reliability as we scale.

Key Responsibilities:

  • Infrastructure Design & Scalability:
  • Architect and implement large-scale ML training pipelines that leverage parallel GPU processing on platforms like GCP or AWS.
  • Enhance our existing infrastructure to fully exploit parallelism and design for future expansion, ensuring that our system is ready to support growth.
  • High-Performance ML Computing & Distributed Systems:
  • Manage and optimize high-performance computing resources.
  • Develop robust distributed computing solutions, addressing challenges like race conditions, memory optimization, and resource allocation.
  • Optimize model training with techniques like mixed precision, ZeRO, Lora, etc.
  • Job Scheduling & Reliability:
  • Design systems for job rescheduling, automated retries, and failure recovery to maximize uptime and training efficiency.
  • Implement intelligent job queuing mechanisms to optimize training workloads and resource utilization.
  • Evaluate and implement tradeoffs between different local and networked storage solutions to improve data throughput and access.
  • Develop strategies for caching training data to optimize performance.
  • Work closely with ML researchers and data scientists to understand training requirements and bottlenecks.
  • Continuously monitor system performance, identify areas for improvement, and implement best practices to enhance scalability and reliability.

Required Qualifications:

  • Bachelor’s degree or higher in Computer Science or a related field.
  • At least 7 years of professional experience in the software industry, with a minimum of 2 years in a tech lead role.
  • Proven experience with high-performance computing environments and distributed systems.
  • Demonstrated ability to scale ML training systems and optimize resource utilization.
  • Hands-on experience with job scheduling systems and managing cloud GPU environments (GCP, AWS, etc.).
  • Deep understanding of distributed computing concepts, including race conditions, memory optimization, and parallel processing.
  • Hands-on experience in ML model tuning for performance.
  • Experience with common ML training and inference tools including PyTorch, TensorRT, Triton, Accelerate, etc.
  • Strong analytical and problem-solving skills with the ability to troubleshoot complex system issues.
  • Excellent communication skills to collaborate effectively with cross-functional teams.

Preferred Qualifications:

  • Experience with container orchestration tools (e.g., Kubernetes) and infrastructure-as-code frameworks.

If you're passionate about building scalable ML systems and optimizing high-performance computing infrastructures, we'd love to hear from you.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Staff Machine Learning Infrastructure Engineer

Dyna Robotics Inc.

Redwood City

On-site

USD 120,000 - 180,000

16 days ago

Staff DevOps Infrastructure Engineer

NMI

Schaumburg

Remote

USD 155,000 - 165,000

9 days ago

Staff Data Infrastructure Engineer

Dyna Robotics

Redwood City

On-site

USD 120,000 - 180,000

6 days ago
Be an early applicant

Staff Data Infrastructure Engineer

Dyna Robotics Inc.

Redwood City

On-site

USD 120,000 - 180,000

7 days ago
Be an early applicant

Senior Staff Infrastructure Engineer - IaC

Advanced Micro Devices, Inc.

California

On-site

USD 150,000 - 200,000

8 days ago

Senior Staff Infrastructure Engineer - IaC

Advanced Micro Devices

San Jose

Hybrid

USD 120,000 - 180,000

9 days ago

Staff Infrastructure Engineer

Pendo

San Francisco

On-site

USD 177,000 - 222,000

10 days ago

[Hiring] Staff Infrastructure Engineer @Sotheby's

Sotheby's

Remote

USD 100,000 - 160,000

15 days ago

Member of Technical Staff, Backend/Infrastructure Engineer

Coframe

Remote

USD 160,000 - 220,000

21 days ago