Enable job alerts via email!

X9546VV3 |【中文岗】Senior Operations Engineer (SRE/AI Platform) 高级运维工程师（SRE/人工智能平台）

TTUKoffer

Kuala Lumpur

On-site

MYR 150,000 - 200,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading global AI infrastructure provider is seeking a Senior Operations Engineer (SRE) to take end-to-end ownership of production environments. Experience in cloud operations, DevOps, and expertise in major cloud platforms such as AWS, GCP, or Azure is essential. The ideal candidate will possess strong scripting skills, experience with monitoring tools, and be fluent in both English and Mandarin for effective collaboration. This is a high-impact role with competitive compensation of RM10,000 - RM15,000 per month.

Qualifications

5+ years of hands-on experience in DevOps, SRE, or cloud operations.
Expertise in at least one major cloud provider (AWS/GCP/Azure).
Proficiency in at least one scripting language (e.g., Python, Go, Shell).
Experience with monitoring and observability tools (e.g., Prometheus, Grafana).
Professional fluency in both English and Mandarin.

Responsibilities

Assume primary responsibility for availability and performance of AI infrastructure.
Act as first responder for production incidents and perform root cause analysis.
Design, build, and maintain automation tools and scripts.
Develop and refine monitoring and alerting systems.
Manage cloud infrastructure using IaC tools.

Skills

DevOps experience

AWS or GCP or Azure expertise

Containerization technologies

Scripting proficiency

Monitoring tools experience

Problem-solving skills

Professional fluency in Mandarin

Strong sense of ownership

Tools

Terraform

Ansible

Docker

Kubernetes

Prometheus

Grafana

[Mandarin-speaking Role] Senior Operations Engineer (SRE/AI Platform)

Compensation: RM10,000 - RM15,000

Job Highlights

Join the international team of a leading global AI infrastructure service provider to build and operate cutting-edge AI platforms.
Take end-to-end ownership of production environments for global users, directly impacting core service reliability and performance.
Gain deep exposure to multi-cloud architecture, GPU computing, and automated operations in a high-impact role.
Collaborate in a multicultural environment with engineering teams across China and North America, enhancing bilingual technical communication skills.

Key Responsibilities

End-to-End Service Ownership: Assume primary responsibility for the availability, latency, performance, and efficiency of AI infrastructure products (Model-API, Serverless, GPU Instances).
Incident Management & Response: Act as the first responder for production incidents, perform root cause analysis (RCA), and implement preventive measures. Participate in an on-call rotation.
Automation & Tooling: Design, build, and maintain automation scripts and tools to streamline operational tasks, deployments, and failure recovery.
Monitoring & Alerting: Develop and refine monitoring and alerting systems (e.g., Prometheus/Grafana) to enable proactive issue detection.
Infrastructure as Code (IaC): Manage and provision cloud infrastructure using IaC tools (e.g., Terraform, Ansible) to ensure consistency and repeatability.
Performance & Cost Optimization: Continuously analyze system performance and resource utilization to identify bottlenecks and optimize cloud platform (AWS/GCP/Azure) costs.
Cross-Functional Collaboration: Work closely with engineering teams in China to understand new features, provide operational feedback, and ensure production readiness of new services.

Must-Have Requirements

5+ years of hands‑on experience in DevOps, SRE, or cloud operations, preferably in a tech or cloud service company.
Expertise in at least one major cloud provider (AWS/GCP/Azure); practical experience with containerization and orchestration technologies (Docker/Kubernetes required).
Proficiency in at least one scripting language (e.g., Python, Go, Shell); solid understanding of IaC tools like Terraform/Ansible.
Hands‑on experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).
Systematic problem‑solving skills with the ability to troubleshoot complex distributed systems under pressure.
Professional fluency in both English and Mandarin (written and spoken) for effective cross‑regional collaboration.
Strong sense of ownership and self‑drive, with the ability to work independently in a remote/distributed team setting.
Nice to Have: Experience with GPU‑accelerated computing; knowledge of MLOps tools (e.g., Kubeflow, MLflow); familiarity with serverless technologies and CI/CD pipelines.

By applying, you acknowledge that TT UKoffer Ltd may process your personal data for recruitment purposes under the lawful basis of legitimate interest. This includes sharing your CV with potential employers. We comply with UK GDPR regulations, and you may request data removal at any time by contacting apply@ttukoffer.co.uk.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top locations

Top companies

Top positions