
Enable job alerts via email!
Generate a tailored resume in minutes
Land an interview and earn more. Learn more
A leading technology firm in Kuala Lumpur is seeking a Site Reliability Engineer to ensure high availability and reliability of IT systems and support cloud operations. The ideal candidate will have over 4 years of experience in system administration, hands-on knowledge of cloud platforms, and strong troubleshooting skills. Responsibilities include implementing security controls, managing data storage and backup, and optimizing observability tools. Excellent communication skills in English and Bahasa Malaysia are required for effective collaboration in a fast-paced environment.
Ensure high availability and reliability of IT systems, applications, and PCI DSS-certified data centres, supporting both internal operations and client-facing platforms.
Perform system administration (Linux and Windows servers), including installation, configuration, patching, monitoring, and performance tuning.
Manage data storage, backup, and disaster recovery (DRP) to ensure data integrity, resilience, and compliance with industry standards.
Conduct capacity planning and lifecycle management of infrastructure resources, ensuring optimal performance and scalability.
Define and monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to measure and improve reliability.
Implement chaos testing and fault-injection practices to proactively identify weaknesses and improve system resilience.
Optimize observability and alerting systems (e.g., Prometheus, Grafana, ELK, Nagios or equivalent) to ensure actionable insights and minimal alert fatigue.
Implement and maintain system and network security controls, including firewall management, VPN, identity/access management, and endpoint security.
Ensure compliance with BNM RMiT, PCI DSS, and ISO 27001 standards, supporting internal and external audits.
Manage system logs and integrate with SIEM platforms to strengthen monitoring and incident response capabilities.
Support vulnerability management programs by coordinating with Security Operations teams for timely patching and remediation.
Participate in risk assessment and security architecture reviews, ensuring SRE practices align with compliance requirements.
Support and optimize hybrid cloud environments (AWS, Azure, GCP) to align with Finexus’ cloud strategy and cost efficiency.
Deploy, configure, and maintain Kubernetes clusters (SUSE Rancher Prime) and containerized workloads to improve scalability and reliability.
Build and maintain CI/CD pipelines for automated deployment, testing, and operational efficiency.
Automate configuration and patch management using tools such as Ansible, Puppet, or equivalent.
Implement Infrastructure as Code (IaC) using Terraform or equivalent for consistent and auditable environment provisioning.
Develop auto-healing and self-recovery automation scripts to reduce manual interventions and mean time to recovery (MTTR).
Implement cost optimization and performance monitoring for cloud and container workloads.
Administer and troubleshoot DNS, DHCP, VPN, load balancers, and core network services to ensure smooth operations.
Support virtualization platforms (Proxmox/etc) and physical server infrastructure within Finexus data centres.
Integrate network observability tools for real-time visibility into latency, bandwidth, and routing anomalies.
Collaborate on zero-trust network segmentation and service mesh integration for improved security and reliability.
Provide on-call support on a rotational basis for production issues and incidents, ensuring rapid resolution and minimal downtime.
Collaborate with application, database, and security teams to deliver reliable, compliant, and high-performance services for clients.
Lead post-incident reviews (PIRs) and blameless retrospectives to identify root causes and preventive actions.
Maintain runbooks and operational documentation to streamline response and improve knowledge transfer.
Leverage AIOps or event-correlation tools to enhance proactive incident detection and reduce false positives.
Bachelor’s or Master’s Degree in Computer Science, Information Technology, Engineering, or related field.
4+ years of experience in Site Reliability Engineering, System Administration, or IT Infrastructure.
Proven experience in Linux and Windows system administration.
Hands‑on experience with cloud operations (AWS, Azure, GCP) and container orchestration (Kubernetes, Rancher).
Strong knowledge of networking, firewalls, DNS, DHCP, VPN, and enterprise security best practices.
Experience in database management (MySQL, PostgreSQL, or equivalent), including backup, tuning, and recovery.
Knowledge of compliance frameworks (PCI DSS, ISO 27001, BNM RMiT) is highly desirable.
Strong problem‑solving and troubleshooting skills in mission‑critical environments.
Excellent communication skills in English and Bahasa Malaysia (spoken and written).
Ability to work independently and collaboratively in a fast‑paced, regulated technology environment.
Experience with SRE toolchains: Prometheus, Grafana, ELK, Terraform, Ansible, Jenkins, GitLab CI/CD, or equivalent.
Possession of relevant certifications, including AWS Certified SysOps Administrator, RHCE, Kubernetes Administrator (CKA), or ISO 27001 Implementer, will be considered an added advantage.