Enable job alerts via email!

MLOps Engineer (PyTorch)

Second Talent

Singapore

On-site

SGD 103,000 - 156,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology solutions firm is looking for an MLOps Engineer specializing in PyTorch to manage the on-premise infrastructure for advanced training workloads. The role involves architecting training and inference pipelines, ensuring high-quality code, and troubleshooting complex issues. Ideal candidates will have expert knowledge of PyTorch, strong experience in C++ and Python, and a solid background in computer science. This full-time position is based in Singapore and will require a proactive approach to optimizing compute workloads.

Qualifications

Expert-level knowledge of PyTorch, including DDP, mixed precision training, and TorchScript.
Advanced programming skills in both C++ and Python.
Solid background in computer science fundamentals including data structures and algorithms.
Hands-on experience debugging and tuning bare-metal servers and Linux administration.
Understanding of low-level networking, distributed training protocols.
Proven track record of building reliable, reproducible pipelines.

Responsibilities

Architect, build, and maintain end-to-end training and inference pipelines using PyTorch.
Develop high-quality tooling in Python and C++ to support model training lifecycle.
Take ownership of core training codebase for clarity and reproducibility.
Design workflows for checkpointing, resuming jobs, and model versioning.
Optimize compute workloads for bare-metal environments.
Troubleshoot low-level issues, including networking bottlenecks and hardware faults.

Skills

Expert-level knowledge of PyTorch

Advanced programming in C++

Advanced programming in Python

Computer science fundamentals

Debugging bare-metal servers

Low-level networking knowledge

Building reproducible pipelines

Job schedulers experience

Tools

PyTorch

Linux

SLURM

Job Title: MLOps Engineer (PyTorch)

Location:Singapore

Job Type:Full-time

About the Opportunity

Our client is seeking anMLOps Engineerwith a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advancedPyTorch-based training workloads.

This position is a perfect fit for an engineer who is not just focused on model outcomes, but on thequality and robustnessof the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.

Key Responsibilities

Architect, build, and maintain end-to-endtraining and inference pipelinesusingPyTorch.
Develop and maintain high-quality, robust tooling in bothPythonandC++to support the entire model training lifecycle.
Takefull ownership of the core training codebase, enforcing best practices for clarity, modularity,reproducibility, and performance.
Design and implement workflows forcheckpointing, resuming jobs, model versioning, and experiment tracking.
Proactivelyoptimize compute workloadsforbare-metalenvironments, focusing on I/O bottlenecks,CPU/GPU utilization, and memory efficiency.
Troubleshoot and debugcomplex,low-level issues, includingnetworking bottlenecks,distributed trainingerrors (e.g.,NCCL), and hardware faults.
Configure and manage allML environments, includingcontainers, package management,GPU drivers, and runtime configurations.
Monitor and debug large-scale training jobs running acrossmultiple nodes and GPUs.

Required Qualifications (You Should Have)

Deep, expert-level knowledge ofPyTorch, includingDDP(DistributedDataParallel),mixed precisiontraining, andTorchScript.
Advanced programming skills in bothC++andPython.
A solid background incomputer science fundamentals(data structures, algorithms,concurrency, operating systems).
Hands-on experience debugging and tuningbare-metal servers, includingLinuxadministration,kernel parameter tuning, andBIOS tuning.
A strong understanding of low-levelnetworking(e.g., RoCE, InfiniBand), interconnects, and distributed training protocols likeNCCLandMPI.
A proven track record of building reliable,reproducible pipelinesfor both model training and evaluation.
Experience withjob schedulers(e.g.,SLURM, or custom runners) and cluster monitoring tools.

Preferred Qualifications (Nice-to-Have)

Experience with non-standard deployments, such ason-premise local clustersoredge devices(i.e.,not public cloud).
Activecontributions to PyTorchor other open-source ML/HPC tools.
Familiarity withInfrastructure-as-Code (IaC)tools likeAnsible,Terraform, orNix.
Experience building out a fulllogging, observability, and alertingstack for training workloads.

How to Apply

Interested candidates are invited to submit their resume, detailing their experience in managingPyTorchworkloads onbare-metalinfrastructure.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top companies

Popular jobs