Enable job alerts via email!

Site Reliability Engineer

Razer

Kuala Lumpur

On-site

MYR 200,000 - 250,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A global gaming technology company located in Kuala Lumpur is looking for a skilled Site Reliability Engineer. The ideal candidate will manage scalable cloud infrastructure primarily on AWS, leveraging Terraform for Infrastructure as Code. Responsibilities include ensuring system reliability, conducting incident management, and automating operations to enhance efficiency. Ideal candidates will possess a strong background in AWS and scripting languages, along with a passion for creating resilient systems.

Qualifications

Minimum 2 years of experience in SRE, DevOps, or Cloud roles.
Proficient in network fundamentals including DNS, HTTP(S), and TCP/IP.
Experience with monitoring tools like CloudWatch and Prometheus.

Responsibilities

Design and maintain Infrastructure as Code (IaC) using Terraform.
Lead architecture reviews focusing on reliability and performance.
Automate infrastructure operations to improve reliability.

Skills

AWS Cloud services

Infrastructure as Code with Terraform

CI/CD tools

Linux system administration

Scripting languages (Python, Bash, etc.)

Education

Bachelor’s degree in Computer Science or related field

Tools

Terraform

AWS CloudFormation

Docker

Kubernetes

Joining Razer will place you on a global mission to revolutionize the way the world games. Razer is a place to do great work, offering you the opportunity to make an impact globally while working across a global team located across 5 continents. Razer is also a great place to work, providing you the unique, gamer-centric #LifeAtRazer experience that will put you in an accelerated growth, both personally and professionally.

Job Responsibilities

We are seeking a skilled and driven Site Reliability Engineer (SRE) to join our growing infrastructure and platform engineering team. The ideal candidate will have hands-on experience in Amazon Web Services (AWS), strong troubleshooting capabilities, and a passion for building scalable, observable, and resilient systems using modern Infrastructure as Code (IaC) and automation tools.

Requirements

Bachelor’s degree in Computer Science, Software Engineering, Information Technology, or a related field.
Minimum 2 years of experience in SRE, DevOps, Cloud Infrastructure, or Systems Administration roles.
Solid hands-on experience with AWS Cloud services including (but not limited to):
Compute: EC2, Lambda, ECS, Auto Scaling
Networking: VPC, Load Balancers, Route 53
Messaging & Storage: SQS, S3, RDS, ElastiCache, SES
Monitoring: CloudWatch, X-Ray
Proficient in Infrastructure as Code using Terraform and/or CloudFormation.
Experience with CI/CD tools (e.g., GitLab CI, Jenkins, CodePipeline, ArgoCD).
Strong understanding of Linux and Windows system administration and troubleshooting.
Comfortable with one or more scripting/programming languages such as Python, Node.js, Bash, Ruby, or JSON/YAML for automation.
Strong grasp of network fundamentals, including DNS, HTTP(S), TLS/SSL, firewalls, and TCP/IP.
Experience with containerization and orchestration (Docker, ECS, or Kubernetes is a plus).
Familiar with observability tools and incident management best practices.

Job Description (Detailed)

Design, develop, and maintain Infrastructure as Code (IaC) using tools like Terraform or AWS CloudFormation.
Implement and operate reliable, scalable cloud infrastructure primarily on AWS (e.g., EC2, ECS, RDS, S3, Lambda, ElastiCache, SQS, SES, Auto Scaling, Load Balancers).
Lead and participate in architecture reviews focusing on reliability, scalability, security, and performance.
Develop and manage robust monitoring, alerting, and logging solutions (e.g., CloudWatch, Prometheus, Grafana, ELK, etc.) to detect and resolve issues proactively.
Perform incident management, postmortems, root cause analysis, and implement continuous improvement strategies.
Collaborate with software engineering teams to improve CI/CD pipelines, deployment automation, and release management.
Automate infrastructure operations, reduce manual toil, and improve reliability using scripting (Python, Bash, Node.js, or Ruby).
Maintain and troubleshoot environments involving web servers, databases, firewalls, DNS, load balancers, and networking.
Ensure systems are compliant with security standards, including patching, hardening, and secure access policies.
Provide on-call support, participate in incident rotations.
Monitor and maintain service-level objectives (SLOs), SLAs, and error budgets to ensure reliability targets are met.
Support from 5:00PM to 2:00AM (UTC+8) shift to ensure continuous SRE coverage.
Undergo initial familiarization period during regular working hours before transitioning to the designated shift.
Provide support and solution handling to incident and tickets assigned.

Pre-Requisites

Are you game?

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.