Job Search and Career Advice Platform

Enable job alerts via email!

MLOps Engineer (PyTorch)

Second Talent

Singapore

On-site

SGD 80,000 - 120,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A cutting-edge technology firm is seeking an experienced MLOps Engineer specializing in PyTorch in Singapore. The ideal candidate will focus on building and maintaining robust, high-quality systems for model training and inference. Key responsibilities include optimizing training pipelines and ensuring codebase scalability and maintainability. Applicants should have strong programming skills in Python and C++, with a deep understanding of systems engineering. This full-time position offers an innovative environment with advanced technology applications.

Qualifications

  • Deep knowledge of PyTorch, including DDP and mixed precision training.
  • Strong programming skills in both C++ and Python.
  • Experience debugging bare-metal servers and Linux administration.

Responsibilities

  • Architect and build end-to-end training and inference pipelines using PyTorch.
  • Take full ownership of the core training codebase.
  • Optimize compute workloads for bare-metal environments.

Skills

Expert-level knowledge of PyTorch
Programming in Python
Programming in C++
Debugging and tuning bare-metal servers
Low-level networking understanding
Building reproducible pipelines

Tools

Ansible
Terraform
SLURM
Job description

Job Title: MLOps Engineer (PyTorch)

Location:Singapore

Job Type:Full-time

About the Opportunity

Our client is seeking anMLOps Engineerwith a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advancedPyTorch-based training workloads.

This position is a perfect fit for an engineer who is not just focused on model outcomes, but on thequality and robustnessof the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.

Key Responsibilities
  • Architect, build, and maintain end-to-endtraining and inference pipelinesusingPyTorch.
  • Develop and maintain high-quality, robust tooling in bothPythonandC++to support the entire model training lifecycle.
  • Takefull ownership of the core training codebase, enforcing best practices for clarity, modularity,reproducibility, and performance.
  • Design and implement workflows forcheckpointing, resuming jobs, model versioning, and experiment tracking.
  • Proactivelyoptimize compute workloadsforbare-metalenvironments, focusing on I/O bottlenecks,CPU/GPU utilization, and memory efficiency.
  • Troubleshoot and debugcomplex,low-level issues, includingnetworking bottlenecks,distributed trainingerrors (e.g.,NCCL), and hardware faults.
  • Configure and manage allML environments, includingcontainers, package management,GPU drivers, and runtime configurations.
  • Monitor and debug large-scale training jobs running acrossmultiple nodes and GPUs.
Required Qualifications (You Should Have)
  • Deep, expert-level knowledge ofPyTorch, includingDDP(DistributedDataParallel),mixed precisiontraining, andTorchScript.
  • Advanced programming skills in bothC++andPython.
  • A solid background incomputer science fundamentals(data structures, algorithms,concurrency, operating systems).APH>
  • Hands-on experience debugging and tuningbare-metal servers, includingLinuxadministration,kernel parameter tuning, andBIOS tuning.
  • A strong understanding of low-levelnetworking(e.g., RoCE, InfiniBand), interconnects, and distributed training protocols likeNCCLandMPI.
  • A proven track record of building reliable,reproducible pipelinesfor both model training and evaluation.
  • Experience withjob schedulers(e.g.,SLURM, or custom runners) and cluster monitoring tools.
Preferred Qualifications (Nice-to-Have)
  • Experience with non-standard deployments, such ason-premise local clustersoredge devices(i.e., notpublic cloud).
  • Activecontributions to PyTorchor other open-source ML/HPC tools.
  • Familiarity withInfrastructure-as-Code (IaC)tools likeAnsible,Terraform, orNix.
  • Experience building out a fulllogging, observability, and alertingstack for training workloads.
How to Apply

Interested candidates are invited to submit their resume, detailing their experience in managingPyTorchworkloads onbare-metalinfrastructure.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.