Enable job alerts via email!

AI Infrastructure Scientist

SHANDA GROUP PTE. LTD.

Singapore

On-site

SGD 80,000 - 120,000

Full time

2 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in Singapore is looking for an experienced individual to join their team in developing large-model training systems and optimizing AI infrastructure. Applicants should have a Master's degree in Computer Science or related fields and over 3 years of relevant experience in distributed systems and AI infrastructure. The role includes significant responsibilities in system architecture and production-level reliability, along with hands-on work with GPU clusters and emerging training paradigms.

Qualifications

3+ years of experience in distributed systems, AI infrastructure, or large-scale ML systems.
Proven experience running production-scale AI training workloads with thousands of GPUs.
Strong systems programming background.

Responsibilities

Design and optimize distributed training systems for large language and multimodal models.
Architect and optimize GPU/AI accelerator clusters.
Establish stability, fault-tolerance, and correctness guarantees for training jobs.

Skills

Distributed systems

AI infrastructure

Performance optimization

Education

Master in Computer Science, Computer Engineering, or a related field

Join our team, to scale our next-generation large-model training systems and AI infrastructure. This role sits at the intersection of distributed systems, GPU clusters, networking, and large-scale model training, with end-to-end ownership from system architecture to production-level reliability. The role also involves cross-institution collaboration, and long-term technical strategy.

Key Responsibilities

Large-Scale Model Training Systems
- Design and optimization of distributed training systems for large language and multimodal models (7B–600B+ parameters).
- Drive innovations in hybrid parallelism (data / tensor / pipeline / sequence parallelism) to maximize performance and efficiency.
- Optimize long-sequence training to significantly reduce memory footprint and improve scalability.
- Support emerging training paradigms including RL-based training and new model architectures.
AI Infrastructure & Cluster Architecture
- Architect and optimize GPU/AI accelerator clusters (10,000+ GPUs), including topology-aware scheduling and resource orchestration.
- Design high-performance networking solutions leveraging InfiniBand, RoCE, and RDMA, tailored for AI workloads.
- Enable training systems across heterogeneous hardware platforms (e.g., NVIDIA GPUs, domestic accelerators such as Ascend 910B).
System Reliability & Production Readiness
- Establish stability, fault-tolerance, and correctness guarantees for large-scale training jobs.
- Design monitoring, alerting, and automated recovery mechanisms for long-running training tasks.
- Build and lead specialized teams to ensure reliable delivery of large-model training at scale.

Qualifications Required

Master in Computer Science, Computer Engineering, or a related field.
3+ years of experience in distributed systems, AI infrastructure, or large-scale ML systems.
Proven experience running production-scale AI training workloads (thousands of GPUs).
Strong systems programming background and performance optimization skills.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top companies

Top positions