Job Search and Career Advice Platform

Enable job alerts via email!

Senior Platform Engineer

FIRMUS METAL INTERNATIONAL PTE. LTD.

Singapore

On-site

SGD 90,000 - 120,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in Singapore is seeking a Senior Platform Engineer to drive the design and implementation of MLOps capabilities. The ideal candidate will have at least 7 years of relevant experience and a Bachelor's degree in a related field. You will collaborate with engineers to enhance DevOps platforms and develop secure ML workflows. This role is suited for a self-starter passionate about innovative engineering solutions.

Qualifications

  • 7+ years of experience as Platform Engineer, Site Reliability Engineer, or similar role.
  • Demonstrated strong proficiency in CI/CD and configuration management.
  • Clear communication skills in English, both written and spoken.

Responsibilities

  • Drive the design and implementation of MLOps capability.
  • Build MLOps capabilities for secure ML workflows.
  • Continuously improve the DevOps platform for reliability and scalability.

Skills

Infrastructure-as-Code
Containerization technologies
Observability stack design
Compliance automation
Scripting and programming skills
Linux internals knowledge
Effective English communication

Education

Bachelor's degree in computer science or a related technical field

Tools

Terraform
Ansible
Docker
Kubernetes
Grafana
Prometheus
OpenTelemetry
Python
Job description
ROLES AND RESPONSIBILITIES

Firmus Technologies is seeking a Senior Platform Engineer to join our Engineering and Technology team. You will drive the design and implementation of our MLOps capability. You will also collaborate with other engineers and make technical decisions on scaling Firmus AI factory platform engineering capabilities to planet scale, from IaC, container orchestration, observability, self-service portal to platform security. This role is ideal for a self‑starter with passion for building things from first principles. You naturally break down complex problems into their fundamental truths to uncover novel and elegant solutions—rather than relying on conventional patterns.

KEY RESPONSIBILITIES
  • Build MLOps capabilities from the ground up, enabling reproducible, scalable, and secure ML workflows across internal and customer-facing environments.
  • Continuously improve our DevOps platform to ensure reliability, scalability, security, and seamless integration with CI/CD pipelines and infrastructure services.
  • Design, implement, operate and secure Kubernetes‑based production infrastructure for high reliability, performance and security, including clusters supporting NVIDIA GB300 NVL72 systems with NVIDIA Quantum‑X800 InfiniBand or Spectrum‑X Ethernet.
  • Develop world‑class observability platforms for internal and external customers to achieve ClusterMAX Platinum tier recognition from SemiAnalysis.
  • Integrate Firmus central services with NVIDIA’s software stack, including Mission Control, NETQ, UFM, and NMX.
  • Lead the enhancement and evangelism of internal platform products that provide a cohesive, composable, secure‑by‑default, and low‑friction self‑service experience that accelerates time to market and reduces engineers’ cognitive load.
  • Drive incident response efforts, participate actively in the on‑call rotation, and lead detailed Root Cause Analysis (RCA) to continuously improve system reliability, operational maturity, and incident handling processes.
SKILLS AND EXPERIENCE
  • Bachelor's degree in computer science or a related technical field.
  • 7+ years of experience as Platform Engineer, Site Reliability Engineer, DevOps engineer, MLOps Engineer or Observability Engineer.
  • Demonstrated strong proficiency: Infrastructure-as-Code, configuration management and CI/CD (e.g., Terraform, Ansible, GitHub Actions, Jenkins, ArgoCD).
  • Demonstrated strong proficiency: Containerization technologies (e.g., Docker), Kubernetes networking and cluster management, including upgrades and troubleshooting.
  • Demonstrated strong proficiency: Observability stack design and scaling (e.g., Loki, Grafana, Tempo, Prometheus, Thanos, ClickHouse).
  • Demonstrated strong proficiency: Telemetry solutions using various technology (e.g., Redfish, gNMI, SNMP, eBPF, streaming analytics).
  • Demonstrated strong proficiency: Unified telemetry collection with OpenTelemetry.
  • Demonstrated strong proficiency: Compliance automation (e.g., OPA, Kyverno).
  • Demonstrated strong proficiency: Competent in scripting and programming skills (e.g., Bash, Python, Go).
  • Demonstrated strong proficiency: Systems knowledge on Linux internals, networking stacks, and distributed storage.
  • Clear and effective English communication, written and spoken.
  • Bonus Points: Experience in high-growth startups or regulated industries with robust security and data privacy requirements, including SOC 2 Type 2 and ISO 27001.

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting‑edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.