JOB DESCRIPTION
We are seeking a proactive and technically skilled Enterprise Observability Centre (EOC) Administrator with 2 to 3 years of experience in observability operations and platform management. The ideal candidate will be familiar with observability platforms that span the entire IT stack and capable of configuring, fine-tuning, and optimising these platforms for performance and reliability. This role requires cross-domain collaboration for requirements gathering and solution delivery. The candidate must also have experience in SLA and incident management and be familiar with service delivery processes aligned with ITIL practices. Flexibility to work outside regular office hours and strong attention to detail are essential. Programming skills are required to support automation, integration, and customisation of observability solutions.
JOB RESPONSIBILITIES
Support day-to-day operations of the Enterprise Observability Center
- Monitor system health, performance, and availability across IT infrastructure.
- Respond to alerts and incidents, ensuring timely resolution and escalation.
- Maintain operational dashboards and reporting tools.
Manage, configure, and fine-tune observability platforms
- Administer observability tools (e.g., Grafana, Prometheus, Splunk, Dynatrace, etc.).
- Optimize data collection, storage, and visualization for performance.
- Implement custom metrics, logs, and traces to enhance observability coverage.
Ensure end-to-end observability across the entire IT stack
- Integrate observability tools with infrastructure, applications, and services.
- Ensure visibility into network, server, application, and cloud environments.
- Identify gaps in monitoring and propose solutions to improve coverage.
Collaborate with cross-domain stakeholders for requirements gathering
- Engage with infrastructure, application, and security teams to understand observability needs.
- Translate business and technical requirements into platform configurations.
- Document and communicate observability solutions and outcomes.
Deliver observability outcomes that support operational excellence
- Develop actionable insights from observability data.
- Support root cause analysis and post-incident reviews.
- Contribute to continuous improvement initiatives based on observability findings.
Work flexibly outside regular office hours as required
- Participate in on-call rotations or scheduled off-hour support.
- Provide support during critical incidents, maintenance windows, or upgrades.
- Ensure observability coverage during weekends and holidays when needed.
Manage SLA and incident response processes
- Monitor SLA adherence and generate performance reports.
- Coordinate incident response activities and ensure timely resolution.
- Maintain incident logs and support post-incident analysis and reporting.
Support service delivery processes aligned with ITIL practices
- Ensure observability tools and practices are integrated into the overall ITIL framework to enhance service reliability and responsiveness.
- Participate in capacity planning and performance tuning.
- Collaborate on automation and self-healing initiatives to improve system resilience.
Maintain documentation and compliance with operational standards
- Document observability configurations, procedures, and incident responses.
- Ensure alignment with internal policies and regulatory requirements.
- Support audit and compliance reviews related to monitoring and observability.
JOB REQUIREMENTS
- Possess a bachelor’s degree in computer science/ information technology or equivalent which is recognized by the Government from any local or abroad higher learning institution with a minimum CGPA of 3.00.
- Minimum 2 years of experience in observability, monitoring, or IT operations.
- Hands-on experience with observability platforms and tools.
- Strong understanding of IT infrastructure, applications, and cloud environments.
- Programming skills in languages such as Python, Bash, or PowerShell for automation, scripting, and integration.
- Experience with automation tools such as Ansible or equivalent
- Strong analytical and troubleshooting skills.
- Experience in SLA, incident management processes and service delivery practices aligned with ITIL
- Willingness to work during non-standard office hours (e.g., weekend, public holiday, wee hour of the day).
- Excellent attention to detail in monitoring, documentation, and reporting.
- Ability to work across teams and communicate effectively with stakeholders.
- Malaysian citizen.
- Obtain a pass in Bahasa Melayu, including an oral test in Sijil Pelajaran Malaysia (SPM) level or equivalent qualification recognised by the Government.
COMPETENCIES
- Technical Proficiency: Strong grasp of observability tools, IT stack components, and cloud environment.
- Automation Capability: Ability to develop scripts and tools to enhance observability and operational efficiency.
- Operational Excellence: Strong focus on reliability, uptime, and process discipline.
- Problem Solving: Analytical mindset for diagnosing and resolving issues.
- Collaboration: Ability to work effectively with cross-functional teams.
- Communication: Clear and concise communication with technical and non-technical stakeholders.
- Process Discipline: Commitment to documentation, compliance, and continuous improvement.
- Detail Orientation: High level of accuracy in monitoring, reporting, and configuration.
- Adaptability: Comfortable working in dynamic and high-pressure environments.
- Responsiveness: Quick to act during incidents and off-hour support needs.
ADDITIONAL SKILLS:
- Familiarity with cloud and hybrid infrastructure environments
- Experience with cloud-native observability tools (e.g., AWS CloudWatch, Azure Monitor)
- Experience with REST APIs and data integration techniques.
- Familiarity with ITIL or other service management frameworks.
JOB PLACEMENT
Data Centre Management, Infrastructure Operations, Digital Infrastructure Department
JOB STATUS
Permanent
All applications are strictly CONFIDENTIAL and only shortlisted candidates will be called in for interview. Applications are deemed UNSUCCESSFUL if there is no feedback from the EPF 2 MONTHS after the closing date of the advertisement.