Enable job alerts via email!

AI Engineer & Applications

FIRMUS METAL INTERNATIONAL PTE. LTD.

Singapore

On-site

SGD 80,000 - 120,000

Full time

2 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading AI solutions company is seeking an AI Engineer to establish efficient, production-grade distributed training processes. This role involves building training recipes and optimizing workflows for hyperscale customers. Candidates should have 5–7 years of experience in distributed machine learning, an expert-level understanding of GPU optimization, and familiarity with frameworks like PyTorch and Megatron-LM. The position is full-time and based in Singapore, contributing to the future of sustainable AI infrastructure.

Qualifications

5–7 years of experience in distributed machine learning.
Expert-level understanding of GPU optimization.
Familiarity with production training frameworks.

Responsibilities

Build production-ready training recipes using TorchTitan and Megatron-LM.
Document parameter tuning for different scales.
Create and validate multi-node NCCL communication patterns.

Skills

Distributed machine learning

GPU optimization

Benchmarking methodology

Hands-on distributed training at scale

Tools

PyTorch

JAX

TorchTitan

Megatron-LM

Role Summary

The AI Engineer will establish Firmus AI Factory as the foundation for efficient, production-grade distributed training by delivering pre-built training recipes (TorchTitan, Megatron etc.), evaluation benchmarks, and model guidance. You\'ll work with customers and internal teams to optimize training efficiency, define baselines, and document best practices. Your templates and benchmarks are the anchor point for our hyperscale customers\' training workflows and our model arena differentiator.

Key Responsibilities

Build production-ready training recipes using TorchTitan and Megatron-LM: model configs, parallelism strategies (FSDP, tensor/pipeline parallelism), checkpointing patterns.
Document parameter tuning for different scales (e.g., \"to train Llama 7B on 8xH100s, use this config and expect X throughput\").
Create and validate multi-node NCCL communication patterns on AI Factory K8s/Slurm clusters.
Design and build benchmarking suites: accuracy, latency, throughput (tokens/sec), cost per token, energy efficiency, MFU.
Implement offline evaluation harnesses for standardized model comparison and leaderboard tracking.
Conduct fine-tuning experiments (LoRA, QLoRA) where they improve product outcomes (e.g., ops domain data), document gains.
Create training efficiency playbooks and publish benchmark results so customers can optimize workloads.
Partner with job scheduling and orchestration engineers on template integration and other AI engineers and software engineers on model optimization trade-offs for inferencing and AI applications.

Skills & Experience

5–7 years of experience in distributed machine learning (PyTorch/JAX, FSDP, DeepSpeed, multi-node training at 10+ GPUs).
Expert-level understanding of GPU optimization: utilization, memory patterns, communication bottlenecks (NCCL collectives).
Hands-on distributed training at scale: debugged convergence issues, profiled bottlenecks, optimized throughput.
Strong benchmarking methodology: design-controlled experiments, measure noise, communicate results rigorously.
Familiarity with TorchTitan, Megatron-LM, or similar production training frameworks.
Understanding of model parallelism strategies and trade-offs (FSDP vs. tensor parallelism vs. pipeline parallelism etc.).

Key Competencies

Distributed Systems Mastery: can explain NCCL, collective communications, and scaling inefficiency.
Benchmarking Rigor: doesn\'t just run benchmarks; validates assumptions, explains variance, communicates uncertainty.
Production Thinking: understands checkpointing, recovery, resource constraints, and cost optimization.
Mentorship: can guide engineers on training best practices and debugging distributed training issues.
Documentation: creates clear, actionable playbooks that customers can follow.

Success Metrics

Benchmark credibility & decision impact increases: benchmarks are trusted and used to drive model/hardware/product decisions.
Training efficiency leadership: sustained improvement in benchmarked training efficiency on representative workloads.
Shorter time-to-validate new models: model candidates can be evaluated quickly and consistently end-to-end.
Template effectiveness improves: recipes reduce misconfigurations and repeated setup failures; fewer training config escalations.
Competitive differentiation strengthens: model arena outputs influence customer adoption and internal roadmap priorities.

Location & Reporting

Singapore or Australia (Launceston, TAS or Sydney, NSW)
Reporting to Head of AI & Applications

Employment Basis

Full-time

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.