Enable job alerts via email!

Site Reliability Engineer

Finexus Sdn Bhd

Kuala Lumpur

On-site

MYR 80,000 - 120,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in Kuala Lumpur is seeking a Site Reliability Engineer to ensure high availability and reliability of IT systems and support cloud operations. The ideal candidate will have over 4 years of experience in system administration, hands-on knowledge of cloud platforms, and strong troubleshooting skills. Responsibilities include implementing security controls, managing data storage and backup, and optimizing observability tools. Excellent communication skills in English and Bahasa Malaysia are required for effective collaboration in a fast-paced environment.

Qualifications

4+ years of experience in Site Reliability Engineering, System Administration, or IT Infrastructure.
Hands-on experience with cloud operations and container orchestration.
Excellent communication skills in English and Bahasa Malaysia.

Responsibilities

Ensure high availability and reliability of IT systems and applications.
Implement security controls and ensure compliance with industry standards.
Administer and troubleshoot DNS, DHCP, VPN, and core network services.

Skills

Linux system administration

Windows system administration

Cloud operations (AWS, Azure, GCP)

Container orchestration (Kubernetes)

Networking knowledge

Firewalls and VPN management

Database management (MySQL, PostgreSQL)

Problem-solving skills

Communication skills in English and Bahasa Malaysia

Education

Bachelor’s or Master’s Degree in Computer Science, Information Technology, Engineering

Tools

Terraform

Ansible

Prometheus

Grafana

Jenkins

GitLab CI/CD

Responsibilities

Ensure high availability and reliability of IT systems, applications, and PCI DSS-certified data centres, supporting both internal operations and client-facing platforms.

Perform system administration (Linux and Windows servers), including installation, configuration, patching, monitoring, and performance tuning.

Manage data storage, backup, and disaster recovery (DRP) to ensure data integrity, resilience, and compliance with industry standards.

Conduct capacity planning and lifecycle management of infrastructure resources, ensuring optimal performance and scalability.

Define and monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets to measure and improve reliability.

Implement chaos testing and fault-injection practices to proactively identify weaknesses and improve system resilience.

Optimize observability and alerting systems (e.g., Prometheus, Grafana, ELK, Nagios or equivalent) to ensure actionable insights and minimal alert fatigue.

Security & Compliance

Implement and maintain system and network security controls, including firewall management, VPN, identity/access management, and endpoint security.

Ensure compliance with BNM RMiT, PCI DSS, and ISO 27001 standards, supporting internal and external audits.

Manage system logs and integrate with SIEM platforms to strengthen monitoring and incident response capabilities.

Support vulnerability management programs by coordinating with Security Operations teams for timely patching and remediation.

Participate in risk assessment and security architecture reviews, ensuring SRE practices align with compliance requirements.

Cloud, Containerization & Automation

Support and optimize hybrid cloud environments (AWS, Azure, GCP) to align with Finexus’ cloud strategy and cost efficiency.

Deploy, configure, and maintain Kubernetes clusters (SUSE Rancher Prime) and containerized workloads to improve scalability and reliability.

Build and maintain CI/CD pipelines for automated deployment, testing, and operational efficiency.

Automate configuration and patch management using tools such as Ansible, Puppet, or equivalent.

Implement Infrastructure as Code (IaC) using Terraform or equivalent for consistent and auditable environment provisioning.

Develop auto-healing and self-recovery automation scripts to reduce manual interventions and mean time to recovery (MTTR).

Implement cost optimization and performance monitoring for cloud and container workloads.

Networking & Core Services

Administer and troubleshoot DNS, DHCP, VPN, load balancers, and core network services to ensure smooth operations.

Support virtualization platforms (Proxmox/etc) and physical server infrastructure within Finexus data centres.

Integrate network observability tools for real-time visibility into latency, bandwidth, and routing anomalies.

Collaborate on zero-trust network segmentation and service mesh integration for improved security and reliability.

Monitoring & Support

Provide on-call support on a rotational basis for production issues and incidents, ensuring rapid resolution and minimal downtime.

Collaborate with application, database, and security teams to deliver reliable, compliant, and high-performance services for clients.

Lead post-incident reviews (PIRs) and blameless retrospectives to identify root causes and preventive actions.

Maintain runbooks and operational documentation to streamline response and improve knowledge transfer.

Leverage AIOps or event-correlation tools to enhance proactive incident detection and reduce false positives.

Job Requirements

Bachelor’s or Master’s Degree in Computer Science, Information Technology, Engineering, or related field.

4+ years of experience in Site Reliability Engineering, System Administration, or IT Infrastructure.

Proven experience in Linux and Windows system administration.

Hands‑on experience with cloud operations (AWS, Azure, GCP) and container orchestration (Kubernetes, Rancher).

Strong knowledge of networking, firewalls, DNS, DHCP, VPN, and enterprise security best practices.

Experience in database management (MySQL, PostgreSQL, or equivalent), including backup, tuning, and recovery.

Knowledge of compliance frameworks (PCI DSS, ISO 27001, BNM RMiT) is highly desirable.

Strong problem‑solving and troubleshooting skills in mission‑critical environments.

Excellent communication skills in English and Bahasa Malaysia (spoken and written).

Ability to work independently and collaboratively in a fast‑paced, regulated technology environment.

Experience with SRE toolchains: Prometheus, Grafana, ELK, Terraform, Ansible, Jenkins, GitLab CI/CD, or equivalent.

Possession of relevant certifications, including AWS Certified SysOps Administrator, RHCE, Kubernetes Administrator (CKA), or ISO 27001 Implementer, will be considered an added advantage.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs