[Mandarin-speaking Role] Senior Operations Engineer (SRE/AI Platform)
- Compensation: RM10,000 - RM15,000
Job Highlights
- Join the international team of a leading global AI infrastructure service provider to build and operate cutting-edge AI platforms.
- Take end-to-end ownership of production environments for global users, directly impacting core service reliability and performance.
- Gain deep exposure to multi-cloud architecture, GPU computing, and automated operations in a high-impact role.
- Collaborate in a multicultural environment with engineering teams across China and North America, enhancing bilingual technical communication skills.
Key Responsibilities
- End-to-End Service Ownership: Assume primary responsibility for the availability, latency, performance, and efficiency of AI infrastructure products (Model-API, Serverless, GPU Instances).
- Incident Management & Response: Act as the first responder for production incidents, perform root cause analysis (RCA), and implement preventive measures. Participate in an on-call rotation.
- Automation & Tooling: Design, build, and maintain automation scripts and tools to streamline operational tasks, deployments, and failure recovery.
- Monitoring & Alerting: Develop and refine monitoring and alerting systems (e.g., Prometheus/Grafana) to enable proactive issue detection.
- Infrastructure as Code (IaC): Manage and provision cloud infrastructure using IaC tools (e.g., Terraform, Ansible) to ensure consistency and repeatability.
- Performance & Cost Optimization: Continuously analyze system performance and resource utilization to identify bottlenecks and optimize cloud platform (AWS/GCP/Azure) costs.
- Cross-Functional Collaboration: Work closely with engineering teams in China to understand new features, provide operational feedback, and ensure production readiness of new services.
Must-Have Requirements
- 5+ years of hands‑on experience in DevOps, SRE, or cloud operations, preferably in a tech or cloud service company.
- Expertise in at least one major cloud provider (AWS/GCP/Azure); practical experience with containerization and orchestration technologies (Docker/Kubernetes required).
- Proficiency in at least one scripting language (e.g., Python, Go, Shell); solid understanding of IaC tools like Terraform/Ansible.
- Hands‑on experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).
- Systematic problem‑solving skills with the ability to troubleshoot complex distributed systems under pressure.
- Professional fluency in both English and Mandarin (written and spoken) for effective cross‑regional collaboration.
- Strong sense of ownership and self‑drive, with the ability to work independently in a remote/distributed team setting.
- Nice to Have: Experience with GPU‑accelerated computing; knowledge of MLOps tools (e.g., Kubeflow, MLflow); familiarity with serverless technologies and CI/CD pipelines.
By applying, you acknowledge that TT UKoffer Ltd may process your personal data for recruitment purposes under the lawful basis of legitimate interest. This includes sharing your CV with potential employers. We comply with UK GDPR regulations, and you may request data removal at any time by contacting apply@ttukoffer.co.uk.