Job Search and Career Advice Platform

Enable job alerts via email!

MLOps Engineer (PyTorch)

Second Talent

Singapore

On-site

SGD 103,000 - 156,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology solutions firm is looking for an MLOps Engineer specializing in PyTorch to manage the on-premise infrastructure for advanced training workloads. The role involves architecting training and inference pipelines, ensuring high-quality code, and troubleshooting complex issues. Ideal candidates will have expert knowledge of PyTorch, strong experience in C++ and Python, and a solid background in computer science. This full-time position is based in Singapore and will require a proactive approach to optimizing compute workloads.

Qualifications

  • Expert-level knowledge of PyTorch, including DDP, mixed precision training, and TorchScript.
  • Advanced programming skills in both C++ and Python.
  • Solid background in computer science fundamentals including data structures and algorithms.
  • Hands-on experience debugging and tuning bare-metal servers and Linux administration.
  • Understanding of low-level networking, distributed training protocols.
  • Proven track record of building reliable, reproducible pipelines.

Responsibilities

  • Architect, build, and maintain end-to-end training and inference pipelines using PyTorch.
  • Develop high-quality tooling in Python and C++ to support model training lifecycle.
  • Take ownership of core training codebase for clarity and reproducibility.
  • Design workflows for checkpointing, resuming jobs, and model versioning.
  • Optimize compute workloads for bare-metal environments.
  • Troubleshoot low-level issues, including networking bottlenecks and hardware faults.

Skills

Expert-level knowledge of PyTorch
Advanced programming in C++
Advanced programming in Python
Computer science fundamentals
Debugging bare-metal servers
Low-level networking knowledge
Building reproducible pipelines
Job schedulers experience

Tools

PyTorch
Linux
SLURM
Job description

Job Title: MLOps Engineer (PyTorch)

Location:Singapore

Job Type:Full-time

About the Opportunity

Our client is seeking anMLOps Engineerwith a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advancedPyTorch-based training workloads.

This position is a perfect fit for an engineer who is not just focused on model outcomes, but on thequality and robustnessof the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.

Key Responsibilities
  • Architect, build, and maintain end-to-endtraining and inference pipelinesusingPyTorch.
  • Develop and maintain high-quality, robust tooling in bothPythonandC++to support the entire model training lifecycle.
  • Takefull ownership of the core training codebase, enforcing best practices for clarity, modularity,reproducibility, and performance.
  • Design and implement workflows forcheckpointing, resuming jobs, model versioning, and experiment tracking.
  • Proactivelyoptimize compute workloadsforbare-metalenvironments, focusing on I/O bottlenecks,CPU/GPU utilization, and memory efficiency.
  • Troubleshoot and debugcomplex,low-level issues, includingnetworking bottlenecks,distributed trainingerrors (e.g.,NCCL), and hardware faults.
  • Configure and manage allML environments, includingcontainers, package management,GPU drivers, and runtime configurations.
  • Monitor and debug large-scale training jobs running acrossmultiple nodes and GPUs.
Required Qualifications (You Should Have)
  • Deep, expert-level knowledge ofPyTorch, includingDDP(DistributedDataParallel),mixed precisiontraining, andTorchScript.
  • Advanced programming skills in bothC++andPython.
  • A solid background incomputer science fundamentals(data structures, algorithms,concurrency, operating systems).
  • Hands-on experience debugging and tuningbare-metal servers, includingLinuxadministration,kernel parameter tuning, andBIOS tuning.
  • A strong understanding of low-levelnetworking(e.g., RoCE, InfiniBand), interconnects, and distributed training protocols likeNCCLandMPI.
  • A proven track record of building reliable,reproducible pipelinesfor both model training and evaluation.
  • Experience withjob schedulers(e.g.,SLURM, or custom runners) and cluster monitoring tools.
Preferred Qualifications (Nice-to-Have)
  • Experience with non-standard deployments, such ason-premise local clustersoredge devices(i.e.,not public cloud).
  • Activecontributions to PyTorchor other open-source ML/HPC tools.
  • Familiarity withInfrastructure-as-Code (IaC)tools likeAnsible,Terraform, orNix.
  • Experience building out a fulllogging, observability, and alertingstack for training workloads.
How to Apply

Interested candidates are invited to submit their resume, detailing their experience in managingPyTorchworkloads onbare-metalinfrastructure.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.