Observability, Automation & AI Ops Engineer
Metlife Solutions Pte Ltd – Kuala Lumpur, Kuala Lumpur
Shortlisted candidates will be invited to apply to join our MetLife KL IT Infrastructure Engineering Challenge Hackathon on Jan 31st 2026.
The Observability, Automation & AI Ops Engineer is responsible for designing, implementing, and optimizing advanced monitoring, automation, and AI‑driven operations solutions across MetLife’s hybrid cloud and on‑premises environments. This role ensures high availability, reliability, and efficiency of IT services by leveraging modern observability platforms, automation frameworks, and artificial intelligence for proactive incident management and continuous improvement.
Key Responsibilities
- Observability Engineering
- Design, deploy, and manage observability platforms (e.g., Elastic, Splunk, Prometheus, Grafana, OpenTelemetry) for end‑to‑end visibility of applications, infrastructure, and business services.
- Develop and maintain telemetry pipelines for logs, metrics, traces, and events.
- Build dashboards and automated alerting systems with AI‑powered anomaly detection.
- Collaborate with DevOps, SRE, and application teams to integrate observability into CI/CD pipelines and cloud‑native architectures.
- Analyze system health, identify trends, and drive data‑driven decisions for performance optimization and reliability.
- Automation Engineering
- Design, implement, and maintain automation solutions for infrastructure provisioning, configuration management, and operational workflows (e.g., Ansible, Terraform, CI/CD tools).
- Develop self‑healing scripts and intelligent runbooks for automated incident response and remediation.
- Integrate automation with monitoring and ITSM tools to streamline operations and reduce manual intervention.
- Lead or participate in automation projects to improve efficiency, reduce errors, and support business agility.
- Stay current with emerging automation technologies and best practices.
- AIOps Engineering
- Implement and maintain AI‑driven systems for real‑time monitoring, predictive analytics, and automated root cause analysis.
- Develop and train machine learning models using operational data (logs, metrics, events, traces) for anomaly detection and forecasting.
- Deploy and manage AIOps platforms (e.g., Moogsoft, Dynatrace, DataDog, Elastic) to enable proactive incident management and self‑healing capabilities.
- Collaborate with IT, DevOps, and Data Science teams to integrate AI/ML into IT operations and service management.
- Monitor and optimize AI model performance, ensuring reliability and continuous improvement.
Qualifications & Skills
- Associate: 0–2 years in observability, automation, or IT operations.
- Engineer: 2–5 years relevant experience.
- Senior: 5+ years with demonstrated technical and/or team leadership.
- Proficiency in observability platforms (Elastic, Splunk, Prometheus, Grafana, OpenTelemetry).
- Strong experience with automation tools (Ansible, Terraform, CI/CD, scripting languages).
- Familiarity with AIOps platforms and AI/ML frameworks (Scikit‑learn, TensorFlow, PyTorch).
- Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Excellent troubleshooting, analytical, and communication skills.
- (Senior Level) Ability to lead, mentor, and manage technical teams.
- Relevant certifications in observability, automation, cloud, or AI/ML platforms are a plus.
- ITIL v4
- Business proficiency in English.