Job Title: MLOps Engineer (PyTorch)
Location:Singapore
Job Type:Full-time
About the Opportunity
Our client is seeking anMLOps Engineerwith a strong background in systems programming and infrastructure engineering. This role is focused on owning and evolving the on-premise infrastructure that powers their advancedPyTorch-based training workloads.
This position is a perfect fit for an engineer who is not just focused on model outcomes, but on thequality and robustnessof the underlying systems. You will be responsible for building high-quality, maintainable training pipelines, solving low-level systems and networking challenges, and ensuring the training codebase is clean, scalable, and built to last.
Key Responsibilities
- Architect, build, and maintain end-to-endtraining and inference pipelinesusingPyTorch.
- Develop and maintain high-quality, robust tooling in bothPythonandC++to support the entire model training lifecycle.
- Takefull ownership of the core training codebase, enforcing best practices for clarity, modularity,reproducibility, and performance.
- Design and implement workflows forcheckpointing, resuming jobs, model versioning, and experiment tracking.
- Proactivelyoptimize compute workloadsforbare-metalenvironments, focusing on I/O bottlenecks,CPU/GPU utilization, and memory efficiency.
- Troubleshoot and debugcomplex,low-level issues, includingnetworking bottlenecks,distributed trainingerrors (e.g.,NCCL), and hardware faults.
- Configure and manage allML environments, includingcontainers, package management,GPU drivers, and runtime configurations.
- Monitor and debug large-scale training jobs running acrossmultiple nodes and GPUs.
Required Qualifications (You Should Have)
- Deep, expert-level knowledge ofPyTorch, includingDDP(DistributedDataParallel),mixed precisiontraining, andTorchScript.
- Advanced programming skills in bothC++andPython.
- A solid background incomputer science fundamentals(data structures, algorithms,concurrency, operating systems).
- Hands-on experience debugging and tuningbare-metal servers, includingLinuxadministration,kernel parameter tuning, andBIOS tuning.
- A strong understanding of low-levelnetworking(e.g., RoCE, InfiniBand), interconnects, and distributed training protocols likeNCCLandMPI.
- A proven track record of building reliable,reproducible pipelinesfor both model training and evaluation.
- Experience withjob schedulers(e.g.,SLURM, or custom runners) and cluster monitoring tools.
Preferred Qualifications (Nice-to-Have)
- Experience with non-standard deployments, such ason-premise local clustersoredge devices(i.e.,not public cloud).
- Activecontributions to PyTorchor other open-source ML/HPC tools.
- Familiarity withInfrastructure-as-Code (IaC)tools likeAnsible,Terraform, orNix.
- Experience building out a fulllogging, observability, and alertingstack for training workloads.
How to Apply
Interested candidates are invited to submit their resume, detailing their experience in managingPyTorchworkloads onbare-metalinfrastructure.