Job Search and Career Advice Platform

Enable job alerts via email!

AI Infrastructure Scientist

SHANDA GROUP PTE. LTD.

Singapore

On-site

SGD 80,000 - 120,000

Full time

2 days ago
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in Singapore is looking for an experienced individual to join their team in developing large-model training systems and optimizing AI infrastructure. Applicants should have a Master's degree in Computer Science or related fields and over 3 years of relevant experience in distributed systems and AI infrastructure. The role includes significant responsibilities in system architecture and production-level reliability, along with hands-on work with GPU clusters and emerging training paradigms.

Qualifications

  • 3+ years of experience in distributed systems, AI infrastructure, or large-scale ML systems.
  • Proven experience running production-scale AI training workloads with thousands of GPUs.
  • Strong systems programming background.

Responsibilities

  • Design and optimize distributed training systems for large language and multimodal models.
  • Architect and optimize GPU/AI accelerator clusters.
  • Establish stability, fault-tolerance, and correctness guarantees for training jobs.

Skills

Distributed systems
AI infrastructure
Performance optimization

Education

Master in Computer Science, Computer Engineering, or a related field
Job description

Join our team, to scale our next-generation large-model training systems and AI infrastructure. This role sits at the intersection of distributed systems, GPU clusters, networking, and large-scale model training, with end-to-end ownership from system architecture to production-level reliability. The role also involves cross-institution collaboration, and long-term technical strategy.

Key Responsibilities
  • Large-Scale Model Training Systems
    • Design and optimization of distributed training systems for large language and multimodal models (7B–600B+ parameters).
    • Drive innovations in hybrid parallelism (data / tensor / pipeline / sequence parallelism) to maximize performance and efficiency.
    • Optimize long-sequence training to significantly reduce memory footprint and improve scalability.
    • Support emerging training paradigms including RL-based training and new model architectures.
  • AI Infrastructure & Cluster Architecture
    • Architect and optimize GPU/AI accelerator clusters (10,000+ GPUs), including topology-aware scheduling and resource orchestration.
    • Design high-performance networking solutions leveraging InfiniBand, RoCE, and RDMA, tailored for AI workloads.
    • Enable training systems across heterogeneous hardware platforms (e.g., NVIDIA GPUs, domestic accelerators such as Ascend 910B).
  • System Reliability & Production Readiness
    • Establish stability, fault-tolerance, and correctness guarantees for large-scale training jobs.
    • Design monitoring, alerting, and automated recovery mechanisms for long-running training tasks.
    • Build and lead specialized teams to ensure reliable delivery of large-model training at scale.
Qualifications Required
  • Master in Computer Science, Computer Engineering, or a related field.
  • 3+ years of experience in distributed systems, AI infrastructure, or large-scale ML systems.
  • Proven experience running production-scale AI training workloads (thousands of GPUs).
  • Strong systems programming background and performance optimization skills.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.