Enable job alerts via email!

Senior DevOps Engineer, ML Infrastructure

Serve Robotics

Ottawa, Toronto, Montreal (administrative region)

Hybrid

CAD 130,000 - 160,000

Full time

2 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology company in Canada is seeking a Senior DevOps Engineer to design and maintain their ML infrastructure. The role involves managing cloud and on-premise environments, automating deployment pipelines, and ensuring system security, reliability, and performance. Candidates should have robust experience with cloud platforms and container orchestration. The base salary range is $130k - 160k CAD depending on experience.

Qualifications

5+ years of experience as a DevOps, SRE, or Infrastructure Engineer, preferably supporting ML or data-intensive systems.
Solid understanding of system security, reliability, and observability.

Responsibilities

Deploy and maintain ML training orchestration system across multiple platforms.
Manage cloud and on-premise environments for large-scale distributed data processing.
Optimize infrastructure costs and resource utilization.

Skills

Cloud platforms (AWS, GCP, or Azure)

Container orchestration (Kubernetes, Docker)

Infrastructure-as-code tools (Terraform, Helm)

CI/CD systems (GitLab CI, Jenkins, ArgoCD)

Python

SQL

Cloud security, IAM

GPU cluster management

Education

Bachelor’s or Master’s degree in Computer Science or Engineering

At Serve Robotics, we’re reimagining how things move in cities. Our personable sidewalk robot is our vision for the future. It’s designed to take deliveries away from congested streets, make deliveries available to more people, and benefit local businesses.

The Serve fleet has been delighting merchants, customers, and pedestrians along the way in Los Angeles, Miami, Dallas, Atlanta and Chicago while doing commercial deliveries. We’re looking for talented individuals who will grow robotic deliveries from surprising novelty to efficient ubiquity.

Who We Are

We are tech industry veterans in software, hardware, and design who are pooling our skills to build the future we want to live in. We are solving real-world problems leveraging robotics, machine learning and computer vision, among other disciplines, with a mindful eye towards the end-to-end user experience. Our team is agile, diverse, and driven. We believe that the best way to solve complicated dynamic problems is collaboratively and respectfully.

As a Senior DevOps Engineer on the Machine Learning (ML) Infrastructure team, you will help design, build, and maintain our petabyte-scale data and ML platform that powers data partnerships, ML research, and autonomy engineering. You will play a key role in ensuring reliability, security, scalability, and performance across our internal systems, and maintain a suite of internal tools used by dozens of engineers. Your work will make a significant impact on our autonomous capabilities and act as a catalyst for the entire autonomy team, helping us train our next generation of ML models.

Responsibilities

Deploy and maintain our ML training orchestration system that operates across multiple platforms.
Manage cloud and on-premise environments for large-scale distributed data processing and ml training/inference systems.
Automate deployment pipelines, monitoring, and alerting for ML and data services.
Collaborate closely with data scientists, ML engineers, and autonomy teams to streamline experimentation and model deployment.
Maintain and improve CI/CD systems to support rapid development and testing.
Implement best practices for system security, reliability, and observability.
Optimize infrastructure costs and ensure efficient resource utilization.
Support internal developer productivity through tooling, documentation, and support.

Qualifications

Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent experience.
5+ years of experience as a DevOps, SRE, or Infrastructure Engineer, preferably supporting ML or data-intensive systems.
Strong experience with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes, Docker).
Proficiency in infrastructure-as-code tools such as Terraform or Helm.
Solid understanding of CI/CD systems (GitLab CI, Jenkins, ArgoCD, etc.).
Experience with Python and SQL
Experience with cloud security, IAM (Identity and Access Management), and access control
Experience analysing and optimizing hardware performance
Experience with GPU cluster management

What Makes You Stand Out

Experience managing large-scale distributed data processing systems.
Experience analysing and optimizing ml training workloads
Background in observability stacks (Prometheus, Grafana, ELK, OpenTelemetry).
Contributions to open-source DevOps or ML infrastructure projects.

* Please note: The base salary range listed in this job description reflects compensation for candidates based in the United States. While we prefer candidates located in the U.S, we are also open to qualified talent working remotely across:

Canada - Base salary range (Canada - all locations): $130k - 160k CAD

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs