Enable job alerts via email!

Site Reliability Engineer

Seekers Malaysia

Selangor

On-site

MYR 60,000 - 90,000

Full time

4 days ago

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A dynamic technology company in Malaysia is seeking a Site Reliability Engineer to optimize and monitor production systems for AI and gaming clients. This role includes responsibilities for system architecture, performance metrics, and operational collaboration. Candidates should have a technical degree and experience in cloud computing environments, along with strong problem-solving and communication skills. Familiarity with Kubernetes, Docker, and AWS is essential. The position offers a challenging and collaborative environment.

Qualifications

Bachelor's degree in a relevant field.
Experience in operations and maintenance in cloud computing or AI environments.
Strong understanding of system architecture and performance monitoring.

Responsibilities

Monitor and respond to faults in the production system.
Drive resolution of operational issues with the business team.
Maintain documentation of system architecture and processes.

Skills

Performance monitoring

Collaboration

Problem-solving

Communication

Education

Bachelor's degree in Computer Science, Engineering, or related field

Tools

Kubernetes

Docker

AWS

The Site Reliability Engineer plays a critical role in monitoring, troubleshooting, and optimizing our production system to ensure the highest levels of performance and stability for our AI and gaming customers worldwide.

Key Responsibilities

Monitor, Review, and Respond to Faults: Take on the responsibility of monitoring, reviewing, responding to faults, troubleshooting, resolving, and subsequently optimizing the production system.
System Architecture and Performance: Continuously monitor and review the system architecture, process logic, system performance, stability, and other technical areas and indicators to ensure their rationality.
Coordination with Business Team: Drive the business team in resolving any issues related to operations and maintenance.
Production Failure Response: Respond promptly to production failures, acting as the overall coordinator for resolution.
Collaborative Problem-Solving: Organize relevant R&D, operations and maintenance, and product teams to collaboratively investigate and resolve problems.
Failure Response Time: Responsible for the failure response time and resolution time, ensuring timely resolution of issues.
Case Studies and Optimization: Conduct case studies on production issues and follow up with optimizations to improve system performance and stability.
Documentation: Maintain comprehensive documentation of system architecture, processes, and troubleshooting procedures.
Continuous Improvement: Identify areas for improvement in the operations and maintenance processes and implement necessary changes.

Skills & Experiences

Bachelor's degree in Computer Science, Engineering, or a related field.
Experience in operations and maintenance development, preferably in a cloud computing or AI-focused environment.
Strong understanding of system architecture, performance monitoring, and troubleshooting methodologies.
Excellent communication and collaboration skills.
Ability to work in a fast-paced, startup environment.
Proficiency in Kubernetes (K8S), CI/CD, and Docker.
Expertise in AWS (VPC, S3, EC2, etc.) or Python (one of the two).
Responsible for building the operations and maintenance infrastructure platform and handling core business operations.
Management experience is a plus, but not required.
Prior experience working in structured environments such as Huawei, ZTE, or banking institutions is preferred.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.