Enable job alerts via email!

MLOps Engineer (PyTorch)

Second Talent

Singapore

On-site

SGD 80,000 - 120,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A cutting-edge technology firm is seeking an experienced MLOps Engineer specializing in PyTorch in Singapore. The ideal candidate will focus on building and maintaining robust, high-quality systems for model training and inference. Key responsibilities include optimizing training pipelines and ensuring codebase scalability and maintainability. Applicants should have strong programming skills in Python and C++, with a deep understanding of systems engineering. This full-time position offers an innovative environment with advanced technology applications.

Qualifications

Deep knowledge of PyTorch, including DDP and mixed precision training.
Strong programming skills in both C++ and Python.
Experience debugging bare-metal servers and Linux administration.

Responsibilities

Architect and build end-to-end training and inference pipelines using PyTorch.
Take full ownership of the core training codebase.
Optimize compute workloads for bare-metal environments.

Skills

Expert-level knowledge of PyTorch

Programming in Python

Programming in C++

Debugging and tuning bare-metal servers

Low-level networking understanding

Building reproducible pipelines

Tools

Ansible

Terraform

SLURM

Job Title: MLOps Engineer (PyTorch)

Location:Singapore

Job Type:Full-time

About the Opportunity

Our client is seeking anMLOps Engineerwith a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advancedPyTorch-based training workloads.

This position is a perfect fit for an engineer who is not just focused on model outcomes, but on thequality and robustnessof the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.

Key Responsibilities

Architect, build, and maintain end-to-endtraining and inference pipelinesusingPyTorch.
Develop and maintain high-quality, robust tooling in bothPythonandC++to support the entire model training lifecycle.
Takefull ownership of the core training codebase, enforcing best practices for clarity, modularity,reproducibility, and performance.
Design and implement workflows forcheckpointing, resuming jobs, model versioning, and experiment tracking.
Proactivelyoptimize compute workloadsforbare-metalenvironments, focusing on I/O bottlenecks,CPU/GPU utilization, and memory efficiency.
Troubleshoot and debugcomplex,low-level issues, includingnetworking bottlenecks,distributed trainingerrors (e.g.,NCCL), and hardware faults.
Configure and manage allML environments, includingcontainers, package management,GPU drivers, and runtime configurations.
Monitor and debug large-scale training jobs running acrossmultiple nodes and GPUs.

Required Qualifications (You Should Have)

Deep, expert-level knowledge ofPyTorch, includingDDP(DistributedDataParallel),mixed precisiontraining, andTorchScript.
Advanced programming skills in bothC++andPython.
A solid background incomputer science fundamentals(data structures, algorithms,concurrency, operating systems).APH>
Hands-on experience debugging and tuningbare-metal servers, includingLinuxadministration,kernel parameter tuning, andBIOS tuning.
A strong understanding of low-levelnetworking(e.g., RoCE, InfiniBand), interconnects, and distributed training protocols likeNCCLandMPI.
A proven track record of building reliable,reproducible pipelinesfor both model training and evaluation.
Experience withjob schedulers(e.g.,SLURM, or custom runners) and cluster monitoring tools.

Preferred Qualifications (Nice-to-Have)

Experience with non-standard deployments, such ason-premise local clustersoredge devices(i.e., notpublic cloud).
Activecontributions to PyTorchor other open-source ML/HPC tools.
Familiarity withInfrastructure-as-Code (IaC)tools likeAnsible,Terraform, orNix.
Experience building out a fulllogging, observability, and alertingstack for training workloads.

How to Apply

Interested candidates are invited to submit their resume, detailing their experience in managingPyTorchworkloads onbare-metalinfrastructure.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top companies

Top positions