Enable job alerts via email!

Senior Platform Engineer

FIRMUS METAL INTERNATIONAL PTE. LTD.

Singapore

On-site

SGD 80,000 - 105,000

Full time

2 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A pioneering technology firm in Singapore is seeking a Senior Platform Engineer to lead the design and implementation of MLOps capabilities. This role involves building secure and scalable platforms, improving DevOps processes, and collaborating on the AI Factory platform engineering. The ideal candidate has over 7 years experience in platform engineering and a strong background in technologies such as Kubernetes, Docker, and Infrastructure as Code. Join us in revolutionizing the AI industry through sustainable engineering practices.

Benefits

Diverse and inclusive workplace

Commitment to sustainable future through engineering

Qualifications

7+ years of experience as Platform Engineer, Site Reliability Engineer, DevOps engineer, MLOps Engineer or Observability Engineer.
Demonstrated strong proficiency in CI/CD tools like GitHub Actions, Jenkins.
Competent in scripting languages such as Bash, Python, or Go.

Responsibilities

Build MLOps capabilities for reproducible and secure ML workflows.
Improve reliability and scalability of the DevOps platform.
Design and operate Kubernetes-based production infrastructure.

Skills

Infrastructure-as-Code

Containerization technologies

Observability stack design

Scripting and programming skills

Clear and effective English communication

Education

Bachelor’s degree in computer science or a related technical field

Tools

Terraform

Docker

Kubernetes

Grafana

Prometheus

ROLE

Firmus Technologies is seeking a Senior Platform Engineer to join our Engineering and Technology team. You will drive the design and implementation of our MLOps capability. You will also collaborate with other engineers and make technical decisions on scaling Firmus AI factory platform engineering capabilities to planet scale, from IaC, container orchestration, observability, self‑service portal to platform security. This role is ideal for a self‑starter with passion for building things from first principles. You naturally break down complex problems into their fundamental truths to uncover novel and elegant solutions – rather than relying on conventional patterns.

KEY RESPONSIBILITIES

Build MLOps capabilities from the ground up, enabling reproducible, scalable, and secure ML workflows across internal and customer‑facing environments.
Continuously improve our DevOps platform to ensure reliability, scalability, security, and seamless integration with CI/CD pipelines and infrastructure services.
Design, implement, operate and secure Kubernetes‑based production infrastructure for high reliability, performance and security, including clusters supporting NVIDIA GB300 NVL72 systems with NVIDIA Quantum‑X800 InfiniBand or Spectrum‑X Ethernet.
Develop world‑class observability platforms for internal and external customers.
Integrate Firmus central services with NVIDIA’s software stack, including Mission Control, NETQ, UFM, and NMX.
Lead the enhancement and evangelism of internal platform products that provide cohesive, composable, secure‑by‑default, and low‑friction self‑service experiences that accelerate time to market and reduce engineers’ cognitive load.
Drive incident response efforts, participate actively in the on‑call rotation, and lead detailed Root Cause Analysis (RCA) to continuously improve system reliability, operational maturity, and incident handling processes.

SKILLS AND EXPERIENCE

Bachelor’s degree in computer science or a related technical field.
7+ years of experience as Platform Engineer, Site Reliability Engineer, DevOps engineer, MLOps Engineer or Observability Engineer.
Demonstrated strong proficiency on the following areas: Infrastructure‑as‑Code, configuration management and CI/CD (e.g., Terraform, Ansible, GitHub Actions, Jenkins, ArgoCD); Containerization technologies (e.g., Docker), Kubernetes networking and cluster management, including upgrades and troubleshooting; Observability stack design and scaling (e.g., Loki, Grafana, Tempo, Prometheus, Thanos, ClickHouse); Telemetry solutions using various technology (e.g., Redfish, gNMI, SNMP, eBPF, streaming analytics); Unified telemetry collection with OpenTelemetry; Compliance automation (e.g., OPA, Kyverno).
Competent in scripting and programming skills (e.g., Bash, Python, Go).
Systems knowledge on Linux internals, networking stacks, and distributed storage.
Clear and effective English communication, written and spoken.
Bonus: Experience in high‑growth startups or regulated industries with robust security and data privacy requirements, including SOC 2 Type 2 and ISO 27001.

ABOUT FIRMUS TECHNOLOGIES

Firmus Technologies is a global leader pioneering the solution to AI’s energy challenge, founded in Australia in 2019 by a visionary team of entrepreneurs and engineers passionate about sustainable computing infrastructure.

Firmus builds and operates AI infrastructure across Asia‑Pacific, utilising its proprietary AI Factory platform to deliver transformative cost‑effective GPU clusters and AI cloud services for developers, enterprise, education and government users.

We are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting‑edge engineering.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top companies

Top positions