Job Search and Career Advice Platform

Enable job alerts via email!

Senior DevOps Engineer GPUaaS

Singtel Group

Singapore

On-site

SGD 60,000 - 80,000

Full time

25 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology company in Singapore is seeking a DevOps Engineer for its GPU-as-a-Service division. This role involves designing and supporting GPU clusters, managing resources, and optimizing systems for AI workloads. Candidates should hold a degree in Computer Science and have experience with DevOps tools like Kubernetes and Jenkins. Opportunities for ongoing training and career growth are available, making this an ideal start for those eager to advance in AI and cloud platforms.

Benefits

Full suite of health and wellness benefits
Ongoing training and development programs
Internal mobility opportunities

Qualifications

  • Bachelor’s degree in Computer Science, Engineering, Information Technology, or related field.
  • Experience with DevOps tools such as Jenkins, Kubernetes, Ansible, and Terraform.
  • Proficiency in scripting languages like Python or Bash.

Responsibilities

  • Design, deploy and support large-scale GPU clusters for AI.
  • Manage and automate provisioning of GPU resources.
  • Troubleshoot compute resource system level issues.
  • Optimize system parameters for AI workload performance.

Skills

Experience with DevOps tools
Proficiency in scripting languages
Strong problem solving skills
Team player

Education

Bachelor’s degree in Computer Science

Tools

Jenkins
Kubernetes
Terraform
Zabbix
Prometheus
Job description

Singtel Digital InfraCo’s RE:AI division is building Asia’s most advanced and sustainable AI infrastructure ecosystem. RE:AI enables enterprises, research institutions, and digital-native businesses to accelerate innovation through responsible, high-performance AI compute and connectivity solutions.

Be a Part of Something BIG!

As an DevOps Engineer for SingTel’s GPU-as-a-Service (GPUaaS), you will help in implementing processes and integration of operations to advance customer’s AI and HPC capabilities. You will be exposed to both physical data center implementation and software solutions in a Singtel GPU-as-a-Service (GPUaaS). This position requires a forward-thinking individual who thrives in dynamic environments and is committed to driving continuous improvement in GPU for AI and HPC environments. This is an excellent opportunity for someone eager to start their career in DevOps and grow their expertise in AI and HPC cloud platforms.

Responsibilities
  • Design, deploy and support large-scale, distribute GPU clusters for AI and ML workloads.
  • Manage and automate provisioning of GPU resources in both on-prem and cloud platforms.
  • Design, implement and manage CI/CD pipelines for AI models and GPU-accelerated applications.
  • Monitor cluster usage, health, performance and availability.
  • Improve infrastructure provisioning, management, and monitoring through automation.
  • Troubleshoot compute resource system level issues such as Slurm, Kubernetes, GPU drivers, CUDA, IB networking.
  • Optimize system parameters (e.g., OS, drivers, networking, library) for AI workload performance.
  • Conduct GPU cluster benchmark and keeping up with the latest advancements in GPU technology.
  • Set up monitoring and logging for GPU resources using Zabbix, Prometheus, NVIDIA DCGM and other tools.
  • Implement security best-practices for multi-tenant GPU-as-a-Service (GPUaaS) environment.
  • Collaborate with software and administrator to to streamline workflows and improve collaboration.
  • Providing technical support and guidance to users of GPU-accelerated systems.
  • Work with senior DevOps engineer to identify bottlenecks and improve development and operational processes for AI and HPC GPU cloud.
  • Learning to solve problems in high-performance distributed computation for AI and HPC GPU cloud computing.
  • This role may require availability outside standard work hours, including nights, weekends and public holidays.
Requirements
  • Bachelor’s degree in Computer Science/Engineering, Information Technology, Systems Engineering, or a related field.
  • Experience with DevOps tools such as Jenkins, Kubernetes, Ansible and Terraform.
  • Solid understanding of DevOps practices, including CI/CD, automation, and monitoring.
  • Proficiency in scripting languages (e.g., Python, Bash).
  • Experience in implementing monitoring solutions such as Zabbix, Prometheus.
  • Familiarity with AI frameworks such as TensorFlow, PyTorch.
  • Understanding of cloud architectures (IaaS, PaaS), GPU architecture and NVIDIA GPUs.
  • Strong verbal, written, and presentation skills in English.
  • Team player with experience in cross-functional coordination.
  • Strong technical problem solving and analytical skills for system optimization.
  • Understanding of how collective communications (MPI, RDMA, and NCCL) works, as well as an understanding of GPU specific aceleration works on GPU cluster.
  • Knowledge of DevOps/ML Ops technologies in GPU cluster such as Docker/containers, Kubernetes, data center deployments
  • Familiarity with Slurm or other HPC workload managers to manage GPU clusters.
  • Understanding of AI & HPC networking technologies such as InfiniBand, RoCE, DPUs.
  • System-level experience specifically GPU-based systems (NVIDIA GPU and SDKs)
  • Understanding how AI and HPC workloads interact with both GPU HW and SW infrastructure.
Rewards that Go Beyond
  • Full suite of health and wellness benefits
  • Ongoing training and development programs
  • Internal mobility opportunities
Your Career Growth Starts Here. Apply Now!
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.