Enable job alerts via email!

Senior Site Reliability Engineer

Razer

Kuala Lumpur

On-site

MYR 150,000 - 200,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A global tech company in Kuala Lumpur is seeking a Senior Site Reliability Engineer to enhance cloud infrastructure reliability using AWS and Infrastructure as Code tools like Terraform. Candidates should have a Bachelor's degree and at least 3 years of experience in SRE or DevOps environments. Strong troubleshooting skills and expertise in cloud services are essential.

Qualifications

Minimum 3 years of experience in SRE, DevOps, or related roles.
Hands-on expertise with AWS Cloud Services.
Proficiency in at least one programming/scripting language.

Responsibilities

Design and implement Infrastructure as Code solutions.
Collaborate with teams to build scalable cloud infrastructure.
Lead architecture design sessions focused on reliability and performance.

Skills

AWS Cloud Services

Infrastructure as Code (IaC)

Python

Troubleshooting

Distributed Systems

Education

Bachelor’s degree in Computer Science or related field

Tools

Terraform

CloudFormation

Linux

Docker

Overview

Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work, offering you the opportunity to make an impact globally while working across a global team located across 5 continents. Razer is also a great place to work, providing you the unique, gamer-centric #LifeAtRazer experience that will put you in an accelerated growth, both personally and professionally.

Responsibilities

Senior Site Reliability Engineer (SRE) to join the infrastructure and platform engineering team with hands-on experience in AWS, strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

Qualifications (Requirements)

Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related field.
Minimum 3 years of experience in SRE, DevOps, cloud infrastructure, or system administration roles.
Hands-on expertise with AWS Cloud Services, including:
- Compute & Containerization: EC2, Lambda, ECS, EKS, Auto Scaling
- Networking: Load Balancers, VPC, Route 53, Security Groups, Firewalls
- Storage & Databases: RDS, ElastiCache, Athena, S3
- Messaging: SQS, SES
Deep understanding of Infrastructure as Code (IaC) tools such as Terraform and CloudFormation.
Proficiency in at least one programming/scripting language: Python, Node.js, Bash, Ruby, or related.
Experience operating and troubleshooting across Linux, Windows, and container-based environments.
Strong understanding of distributed systems, cloud networking (routers, switches), firewalls, DNS, and HTTP/TLS.
Experience implementing monitoring and alerting systems and working with incident management processes.
Experience with Zero Downtime Deployments, blue/green or canary deployments.
Familiarity with cost optimization and right-sizing AWS resources.
Exposure to multi-region, multi-account AWS architecture.
Understanding of API gateway, or edge networking (e.g., Akamai, CloudFront).

Job Description

Design, implement, and maintain Infrastructure as Code (IaC) solutions using Terraform and/or CloudFormation across multi-account AWS environments.
Collaborate with developers, architects, and DevOps teams to build scalable, secure, and observable cloud infrastructure.
Lead and participate in architecture design sessions, focusing on system reliability, scalability, security, and performance.
Implement and manage robust monitoring, alerting, and observability solutions (e.g., CloudWatch, Prometheus, ELK, Datadog).
Set and monitor Key Performance Indicators (KPIs) for system uptime, latency, throughput, and overall reliability.
Drive incident response processes, including coordination, triaging, resolution, documentation, and post-incident reviews (PIRs).
Supervise and mentor junior SREs and infrastructure engineers, fostering knowledge-sharing and team growth.
Collaborate across development, operations, and security teams to ensure secure and compliant deployments.
Automate manual tasks and workflows through scripting and tooling (Python, Node.js, Bash, Ruby, JSON/YAML).
Troubleshoot complex infrastructure issues across Linux, Windows, Docker, and cloud-native environments.
Provide IaC and CI/CD best practices to ensure repeatability, scalability, and compliance across all environments.
Provide on-call support, participate in incident rotations, and lead technical investigations during outages or degradations.
Strong understanding and experience for Disaster Recovery (DR).

Support from 5:00PM to 2:00AM (UTC+8) shift to ensure continuous SRE coverage. Undergo initial familiarization period during regular working hours before transitioning to the designated shift. Provide support and solution handling to incident and tickets assigned.

Pre-Requisites

Are you game?

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top locations

Top companies

Top positions