Enable job alerts via email!

Observability, Automation & AI Ops Engineer

Metlife Solutions Pte Ltd

Kuala Lumpur

On-site

MYR 60,000 - 90,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading IT solutions provider in Kuala Lumpur is seeking an Observability, Automation & AI Ops Engineer to design, implement, and optimize monitoring and automation solutions across hybrid cloud environments. The role involves ensuring IT service availability and efficiency through advanced observability platforms and AI-driven operations. Candidates should have experience with tools like Elastic, Ansible, and AIOps frameworks, as well as strong analytical and communication skills. Proficiency in English is required, and relevant certifications are preferred.

Qualifications

0–2 years in observability, automation, or IT operations for Associate role.
2–5 years relevant experience for Engineer role.
5+ years with demonstrated technical and/or team leadership for Senior role.
Relevant certifications in observability, automation, cloud, or AI/ML platforms are a plus.
Business proficiency in English is required.

Responsibilities

Design, deploy, and manage observability platforms for end-to-end visibility of applications.
Develop and maintain telemetry pipelines for logs, metrics, traces, and events.
Implement and maintain AI-driven systems for real-time monitoring and predictive analytics.

Skills

Proficiency in observability platforms (Elastic, Splunk, Prometheus, Grafana, OpenTelemetry)

Strong experience with automation tools (Ansible, Terraform, CI/CD, scripting languages)

Familiarity with AIOps platforms and AI/ML frameworks (Scikit-learn, TensorFlow, PyTorch)

Excellent troubleshooting, analytical, and communication skills

Education

Bachelor's degree in Computer Science or related field

Tools

Elastic

Splunk

Prometheus

Grafana

Ansible

Terraform

Moogsoft

Dynatrace

DataDog

Observability, Automation & AI Ops Engineer

Metlife Solutions Pte Ltd – Kuala Lumpur, Kuala Lumpur

Shortlisted candidates will be invited to apply to join our MetLife KL IT Infrastructure Engineering Challenge Hackathon on Jan 31st 2026.

The Observability, Automation & AI Ops Engineer is responsible for designing, implementing, and optimizing advanced monitoring, automation, and AI‑driven operations solutions across MetLife’s hybrid cloud and on‑premises environments. This role ensures high availability, reliability, and efficiency of IT services by leveraging modern observability platforms, automation frameworks, and artificial intelligence for proactive incident management and continuous improvement.

Key Responsibilities

Observability Engineering
Design, deploy, and manage observability platforms (e.g., Elastic, Splunk, Prometheus, Grafana, OpenTelemetry) for end‑to‑end visibility of applications, infrastructure, and business services.
Develop and maintain telemetry pipelines for logs, metrics, traces, and events.
Build dashboards and automated alerting systems with AI‑powered anomaly detection.
Collaborate with DevOps, SRE, and application teams to integrate observability into CI/CD pipelines and cloud‑native architectures.
Analyze system health, identify trends, and drive data‑driven decisions for performance optimization and reliability.
Automation Engineering
Design, implement, and maintain automation solutions for infrastructure provisioning, configuration management, and operational workflows (e.g., Ansible, Terraform, CI/CD tools).
Develop self‑healing scripts and intelligent runbooks for automated incident response and remediation.
Integrate automation with monitoring and ITSM tools to streamline operations and reduce manual intervention.
Lead or participate in automation projects to improve efficiency, reduce errors, and support business agility.
Stay current with emerging automation technologies and best practices.
AIOps Engineering
Implement and maintain AI‑driven systems for real‑time monitoring, predictive analytics, and automated root cause analysis.
Develop and train machine learning models using operational data (logs, metrics, events, traces) for anomaly detection and forecasting.
Deploy and manage AIOps platforms (e.g., Moogsoft, Dynatrace, DataDog, Elastic) to enable proactive incident management and self‑healing capabilities.
Collaborate with IT, DevOps, and Data Science teams to integrate AI/ML into IT operations and service management.
Monitor and optimize AI model performance, ensuring reliability and continuous improvement.

Qualifications & Skills

Associate: 0–2 years in observability, automation, or IT operations.
Engineer: 2–5 years relevant experience.
Senior: 5+ years with demonstrated technical and/or team leadership.

Proficiency in observability platforms (Elastic, Splunk, Prometheus, Grafana, OpenTelemetry).
Strong experience with automation tools (Ansible, Terraform, CI/CD, scripting languages).
Familiarity with AIOps platforms and AI/ML frameworks (Scikit‑learn, TensorFlow, PyTorch).
Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
Excellent troubleshooting, analytical, and communication skills.
(Senior Level) Ability to lead, mentor, and manage technical teams.

Relevant certifications in observability, automation, cloud, or AI/ML platforms are a plus.
ITIL v4

Business proficiency in English.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs