Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineer - ARK Large Model Platform (Singapore)

ByteDance

Singapore

On-site

SGD 80,000 - 120,000

Full time

Yesterday
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A global technology company in Singapore is seeking a Site Reliability Engineer to develop and maintain the Ark Large Model Platform. The role involves managing large-scale systems and ensuring their stability through effective DevOps practices. Ideal candidates will have a degree in Computer Science and proficiency in cloud-native technologies, alongside experience in Golang, Python, or Java.

Qualifications

  • Research experience in cloud computing or large-scale model systems.
  • Professional proficiency in one of Golang, Python, or Java.
  • Experience operating and maintaining large-scale systems.

Responsibilities

  • Responsible for developing Ark Large Model Platform on Volcano Engine.
  • Manage stability of large-scale model systems using DevOps practices.
  • Develop observability systems for monitoring the stability of large model systems.

Skills

Cloud-native technologies
Golang
Python
Java
Infrastructure as code

Education

B. Sc or higher degree in Computer Science

Tools

Terraform
Job description
Site Reliability Engineer - ARK Large Model Platform (Singapore)

Job Code: A118803
Location: Singapore
Team: Applied Machine Learning (AML)-Enterprise
Employment Type: Regular

Responsibilities
  • Responsible for Ark Large Model Platform development on Volcano Engine, researching systematic solutions for large model implementations and applications across various industries.
  • Manage and oversee the stability of both control and data aspects of large-scale model systems through effective DevOps practices.
  • Develop and enhance observability systems for monitoring the stability of large model systems, ensuring high reliability and performance.
  • Handle super large‑scale cluster management and ensure efficient operation and maintenance of large model systems.
Qualifications
  • Minimum: B. Sc or higher degree in Computer Science or related fields, with R&D experience in cloud computing or large‑scale model systems.
  • Proficiency in cloud‑native technologies and understanding of the relevant technology stack.
  • Expertise in one of the following programming languages: Golang, Python, or Java, with professional proficiency.
  • Familiarity with cloud‑native technologies for log collection, monitoring, and alerting.
  • Preferred: Prior experience in constructing and maintaining stability systems for large‑scale infrastructures.
  • Experience operating and maintaining large‑scale systems.
  • Experience with infrastructure as code, particularly Terraform, is highly desirable.
About ByteDance

Founded in 2012, ByteDance’s mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut, and Pico, as well as platforms specific to the China market, ByteDance has made it easier and more fun for people to connect, consume, and create content.

Why Join ByteDance

Inspiring creativity is at the core of ByteDance’s mission. Our innovative products are built to help people authentically express themselves, discover, and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity, and enrich life – a mission we work towards every day.

Diversity & Inclusion

ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. We celebrate our diverse voices and strive to create an environment that reflects the many communities we reach.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.