Enable job alerts via email!

Senior Machine Learning Infrastructure Engineer

PlusAI Inc

Santa Clara (CA)

On-site

USD 160,000 - 200,000

Full time

27 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking a Senior ML Infrastructure Engineer to design scalable architectures for handling petabytes of data. The role involves building robust pipelines and managing large-scale GPU clusters, offering significant technical and professional growth opportunities. Ideal candidates will thrive in environments leveraging modern cloud-native technologies and will be responsible for ensuring high availability and reliability of the ML platform.

Qualifications

3+ years of software engineering experience focusing on ML infrastructure.
Proficiency in at least one deep learning framework.

Responsibilities

Design and develop scalable systems for ML models.
Build and maintain data pipelines and versioning systems.
Collaborate with teams to improve platform usability.

Skills

Communication

Python

C++

Education

PhD in Computer Science

MS in Electrical Engineering

Tools

Docker

Kubernetes

PyTorch

TensorFlow

Apache Airflow

Kubeflow

As a Senior ML Infrastructure Engineer at Plus, you will design scalable architectures capable of handling petabytes of data while ensuring optimal performance for both training and inference phases. You will build robust pipelines for managing model versioning systems and experiment tracking frameworks, which are essential for maintaining reproducibility across experiments. Additionally, you will be responsible for managing large-scale GPU clusters. This role offers unparalleled opportunities—both technically and professionally—for individuals passionate about solving challenging problems using modern cloud-native technologies. Ideal candidates thrive in environments that leverage tools such as Docker containers orchestrated via Kubernetes clusters, seamlessly integrated with state-of-the-art deep learning frameworks like PyTorch or TensorFlow. If you are eager to push the boundaries of what's possible in machine learning infrastructure and contribute to cutting-edge solutions, this position is an excellent fit!

Responsibilities:

Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale.
Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks.
Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability.
Implement distributed systems and storage solutions optimized for machine learning workloads.
Drive improvements in CI/CD workflows for ML models and infrastructure.
Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems.
Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform.
Mentor junior engineers and contribute to a culture of technical excellence.

Required Skills:

PhD or MS in Computer Science, Electrical Engineering, or related field.
Good oral and written communication skills.
PhD new grad or Masters with 3+ years of software engineering experience focusing on ML infrastructure or distributed systems.
Proficiency in Python, C++, SQL.
Deep understanding of containerization, orchestration technologies, distributed ML workloads, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, mlflow).
Experience deploying and managing resources across multiple cloud platforms (AWS, GCP, or on-prem).
Proficiency in at least one deep learning framework, such as PyTorch, and data pipeline tools (e.g., Apache Airflow, Prefect).
Strong knowledge of distributed systems, databases, and storage solutions.
Extensive software design and development skills.
Ability to learn and adapt to new technologies and contribute productively.

Preferred Skills:

Familiarity with deep learning architectures like CNNs and Transformers.
Experience building large-scale ML datasets, MLOps pipelines, and distributed computing frameworks like Ray.
Experience working with autonomous vehicles or robotics.
Commitment to adhering to the company’s QMS requirements and contributing to continuous improvement.
Ensuring work meets customer requirements, regulatory standards, and company quality policies.

Salary Range:

$160,000 - $200,000 a year.

Our compensation package (cash and equity) is determined based on the position, location, qualifications, and experience.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs