Enable job alerts via email!

Site Reliability Engineer - Scalable Infra for AI

SECOND TALENT SG PTE. LTD.

Singapore

On-site

SGD 60,000 - 100,000

Full time

Today

Be an early applicant

Job summary

A dynamic AI startup in Asia is seeking an Infrastructure Manager to oversee the reliability and scalability of their global systems. The ideal candidate will have a Bachelor's degree in Computer Science and at least 3 years of experience in SRE or DevOps. Responsibilities include managing infrastructure clusters, building CI/CD pipelines, and troubleshooting critical incidents. Strong skills in Linux, Kubernetes, and cloud platforms are essential. Join us in making a difference in the AI landscape.

Qualifications

3+ years in SRE, DevOps, or system operations.
Self-driven with a strong problem-solving mindset.

Responsibilities

Manage container and open-source infrastructure clusters.
Build and maintain CI/CD pipelines, monitoring, and logging tools.
Troubleshoot and resolve critical incidents rapidly.
Enhance system availability through architectural improvements.
Drive automation across all levels of operations.
Work closely with engineering to champion infrastructure best practices.
Participate in 24/7 support rotations.

Skills

Linux

Shell/Python scripting

System-level performance tuning

Cloud platforms (AWS, GCP, Azure)

Kubernetes

Docker

GitLab CI

ArgoCD

MySQL

Redis

Kafka

Nginx

Elasticsearch

Education

Bachelor's degree in Computer Science or equivalent

Overview

Be the infrastructure hero for one of Asia’s most dynamic AI startups. This is an opportunity to own reliability, scalability, and efficiency across global systems.

Key Responsibilities

Manage container and open-source infrastructure clusters.
Build and maintain CI/CD pipelines, monitoring, and logging tools.
Troubleshoot and resolve critical incidents rapidly.
Enhance system availability through architectural improvements.
Drive automation across all levels of operations.
Work closely with engineering to champion infrastructure best practices.
Participate in 24/7 support rotations.

Requirements

Bachelor's degree in Computer Science or equivalent.
3+ years in SRE, DevOps, or system operations.
Strong with Linux, Shell/Python scripting, and system-level performance tuning.
Hands-on with cloud platforms (AWS, GCP, Azure).
Advanced knowledge of Kubernetes, Docker, GitLab CI, and ArgoCD.
Familiarity with managing MySQL, Redis, Kafka, Nginx, Elasticsearch, JVM apps.
Self-driven with a strong problem-solving mindset.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top companies

Popular jobs

Site Reliability Engineer - Scalable Infra for AI

SECOND TALENT SG PTE. LTD.

Singapore

On-site

SGD 60,000 - 100,000