Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineer

Seekers Malaysia

Selangor

On-site

MYR 60,000 - 90,000

Full time

4 days ago
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A dynamic technology company in Malaysia is seeking a Site Reliability Engineer to optimize and monitor production systems for AI and gaming clients. This role includes responsibilities for system architecture, performance metrics, and operational collaboration. Candidates should have a technical degree and experience in cloud computing environments, along with strong problem-solving and communication skills. Familiarity with Kubernetes, Docker, and AWS is essential. The position offers a challenging and collaborative environment.

Qualifications

  • Bachelor's degree in a relevant field.
  • Experience in operations and maintenance in cloud computing or AI environments.
  • Strong understanding of system architecture and performance monitoring.

Responsibilities

  • Monitor and respond to faults in the production system.
  • Drive resolution of operational issues with the business team.
  • Maintain documentation of system architecture and processes.

Skills

Performance monitoring
Collaboration
Problem-solving
Communication

Education

Bachelor's degree in Computer Science, Engineering, or related field

Tools

Kubernetes
Docker
AWS
Job description

The Site Reliability Engineer plays a critical role in monitoring, troubleshooting, and optimizing our production system to ensure the highest levels of performance and stability for our AI and gaming customers worldwide.

Key Responsibilities
  • Monitor, Review, and Respond to Faults: Take on the responsibility of monitoring, reviewing, responding to faults, troubleshooting, resolving, and subsequently optimizing the production system.
  • System Architecture and Performance: Continuously monitor and review the system architecture, process logic, system performance, stability, and other technical areas and indicators to ensure their rationality.
  • Coordination with Business Team: Drive the business team in resolving any issues related to operations and maintenance.
  • Production Failure Response: Respond promptly to production failures, acting as the overall coordinator for resolution.
  • Collaborative Problem-Solving: Organize relevant R&D, operations and maintenance, and product teams to collaboratively investigate and resolve problems.
  • Failure Response Time: Responsible for the failure response time and resolution time, ensuring timely resolution of issues.
  • Case Studies and Optimization: Conduct case studies on production issues and follow up with optimizations to improve system performance and stability.
  • Documentation: Maintain comprehensive documentation of system architecture, processes, and troubleshooting procedures.
  • Continuous Improvement: Identify areas for improvement in the operations and maintenance processes and implement necessary changes.
Skills & Experiences
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • Experience in operations and maintenance development, preferably in a cloud computing or AI-focused environment.
  • Strong understanding of system architecture, performance monitoring, and troubleshooting methodologies.
  • Excellent communication and collaboration skills.
  • Ability to work in a fast-paced, startup environment.
  • Proficiency in Kubernetes (K8S), CI/CD, and Docker.
  • Expertise in AWS (VPC, S3, EC2, etc.) or Python (one of the two).
  • Responsible for building the operations and maintenance infrastructure platform and handling core business operations.
  • Management experience is a plus, but not required.
  • Prior experience working in structured environments such as Huawei, ZTE, or banking institutions is preferred.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.