Enable job alerts via email!

Sr Machine Learning Engineer - Infinia AI Performance

DDN

United States

Remote

USD 90,000 - 150,000

Full time

30+ days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Senior Machine Learning Engineer to lead the deployment of AI/ML training pipelines. This role involves designing, optimizing, and operationalizing advanced AI applications using cutting-edge tools like Apache Spark and MLflow. Collaborating with data scientists and engineers, you'll ensure robust and efficient model deployment while maintaining best practices in CI/CD. Join a dynamic team committed to innovation and excellence, where your contributions will significantly impact the future of AI and data management. If you're passionate about AI and eager to make a difference, this opportunity is perfect for you.

Benefits

Flexible working hours
Health insurance
Professional development opportunities
Remote work options
Team-building activities

Qualifications

  • 7+ years experience in MLOps or related roles.
  • Extensive experience with Apache Spark and MLflow.

Responsibilities

  • Design and deploy large-scale AI/ML training pipelines.
  • Integrate MLflow for tracking and managing ML experiments.

Skills

Machine Learning Operations (MLOps)
Problem-Solving
Collaboration
Communication
Performance Optimization

Education

Bachelor’s degree in Computer Science
Master’s degree in Data Science

Tools

Apache Spark
Apache Airflow
MLflow
Docker
Kubernetes
Terraform
Ansible
AWS
GCP
Azure

Job description

Overview

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.

"DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC

“The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA

DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

Job Description

We are seeking a talented and experienced Senior Machine Learning Engineer to help us deploy AI/ML training and advanced Retrieval-Augmented Generation (RAG) pipelines for high-performance AI applications. You will be responsible for designing, deploying, and optimizing large-scale AI training and inference pipelines. You will work closely with data scientists and software developers to operationalize models using open-source tools like Apache Spark, Airflow, and MLflow. You will collaborate in our efforts to scale Retrieval-Augmented Generation (RAG) pipelines for AI applications, ensuring robust and efficient deployment.

Key Responsibilities:

  • Design and deploy large-scale AI/ML training pipelines using open-source tools such as Apache Spark and Apache Airflow.
  • Integrate MLflow with DDN’s Infinia product for tracking and managing machine learning experiments, model versioning, and deployment.
  • Implement and scale Retrieval-Augmented Generation (RAG) pipelines to enable efficient retrieval of knowledge for generative models.
  • Automate, monitor, and optimize the end-to-end ML workflows and pipelines for production-grade applications.
  • Work collaboratively with cross-functional teams including data science, engineering, and product to operationalize AI/ML models.
  • Maintain and improve CI/CD pipelines for ML models, ensuring smooth transitions from research to production environments.
  • Utilize cloud platforms (AWS, GCP, or Azure) for scalable infrastructure management.
  • Monitor and troubleshoot pipeline performance issues, implementing solutions to optimize runtime and resource usage.
  • Ensure best practices in version control, containerization (Docker, Kubernetes), and infrastructure as code (Terraform, Ansible).
  • Keep up-to-date with the latest developments in MLOps, AI/ML frameworks, and tooling.

Qualifications:

  • Bachelor’s or Master’s degree in Computer Science, Data Science, Machine Learning, or related fields.
  • 7+ years of experience in machine learning operations (MLOps) or related roles.
  • Extensive experience with Apache Spark, Apache Airflow, and MLflow or equivalent.
  • Proven expertise in building and scaling AI/ML pipelines.
  • Strong understanding of machine learning frameworks and libraries (TensorFlow, PyTorch, NVIDIA NeMo).
  • Experience in deploying open-source vector databases at scale.
  • Solid understanding of cloud infrastructure (AWS, GCP, Azure) and distributed computing.
  • Proficiency with containerization tools (Docker, Kubernetes) and infrastructure as code.
  • Excellent problem-solving and troubleshooting skills, with attention to detail and performance optimization.
  • Strong communication and collaboration skills.

Preferred Qualifications:

  • Experience with large-scale data processing and storage solutions (Hadoop, Hive, HDFS).
  • Knowledge of NLP techniques and tools for model deployment.

This position requires participation in an on-call rotation to provide after-hours support as needed.

Join our dynamic and driven team, where engineering excellence is at the heart of everything we do. We seek individuals who love to challenge themselves and are fueled by curiosity. Here, you'll have the opportunity to work across various areas of the company, thanks to our flat organizational structure that encourages hands-on involvement and direct contributions to our mission. Leadership is earned by those who take initiative and consistently deliver outstanding results, both in their work ethic and deliverables, making strong prioritization skills essential. Additionally, we value strong communication skills in all our engineers and researchers, as they are crucial for the success of our teams and the company as a whole.

Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:

  • Coding assessment: Often in a language of your choice.
  • Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
  • Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
  • Meet and greet with the wider team.
  • Our goal is to finish the main process in 2-3 weeks at most.

DataDirect Networks (DDN) is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law.

#LI-Remote

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Sr Machine Learning Engineer - Infinia AI Performance

DataDirect Networks

Remote

USD 120,000 - 180,000

7 days ago
Be an early applicant

Sr Machine Learning Engineer - Infinia AI Performance

Data Direct Networks

Remote

USD 90,000 - 150,000

29 days ago

Sr Machine Learning Engineer - Infinia AI Performance

DataDirect Networks, Inc.

Remote

USD 90,000 - 150,000

30+ days ago