Enable job alerts via email!

Site Reliability Engineer - AML Global Recommendation - USDS

TN United Kingdom

London

On-site

GBP 70,000 - 90,000

Full time

3 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading company is seeking a Site Reliability Engineer to join their AML Global Recommendation team in London. The role involves designing and maintaining robust systems, ensuring performance and reliability, and collaborating with software engineers. Ideal candidates will have strong programming skills and experience in large-scale systems.

Qualifications

  • At least 3 years of experience in SRE or software engineering roles.
  • Proficiency in programming languages such as C, C++, Python, or Go.

Responsibilities

  • Design, build, and maintain highly available, scalable, and fault-tolerant systems.
  • Monitor and analyze system performance, resolving issues proactively.
  • Collaborate with software engineering teams to ensure reliability.

Skills

Linux
Troubleshooting
Programming
Data Structures
Algorithms

Education

Bachelor's in Computer Science
Master's in Engineering

Tools

TensorFlow
PyTorch
MXNet
PaddlePaddle

Job description

Social network you want to login/join with:

Site Reliability Engineer - AML Global Recommendation - USDS, London

col-narrow-left

Client:

TikTok

Location:

London, United Kingdom

Job Category:

Other

-

EU work permit required:

Yes

col-narrow-right

Job Reference:

f336e8e11feb

Job Views:

3

Posted:

17.05.2025

Expiry Date:

01.07.2025

col-wide

Job Description:

About the Team: The Site Reliability Engineering (SRE) team for AML (Applied Machine Learning) combines system engineering and machine learning to develop and operate a large-scale AI/ML recommendation system for the United States and globally. As part of the SRE team, you'll sharpen your skills in coding, performance analysis, and managing large-scale systems. Join us to influence the future of AML systems and impact TikTok users worldwide.

Responsibilities:

  1. Design, build, and maintain highly available, scalable, and fault-tolerant systems.
  2. Monitor and analyze system performance, resolving issues proactively.
  3. Develop and maintain automated monitoring, alerting, and incident response systems.
  4. Collaborate with software engineering teams to ensure reliability, scalability, and performance in application design.
  5. Implement security best practices and ensure regulatory compliance.
  6. Participate in on-call rotations, responding to incidents during and outside business hours.
  7. Conduct root cause analysis, hold post-mortem reviews, and implement preventative measures.

Minimum Qualifications:

  • At least 3 years of experience in SRE or software engineering roles.
  • Expertise in troubleshooting Linux-based distributed systems.
  • Bachelor's or Master's degree in Computer Science or Engineering.
  • Proficiency in programming languages such as C, C++, Python, or Go.
  • Strong understanding of data structures and algorithms.
  • Knowledge of relational database systems.

Preferred Qualifications:

  • Experience designing and maintaining large-scale systems.
  • Understanding of code optimization and automation.
  • Proficiency in machine learning frameworks like TensorFlow, PyTorch, MXNet, or PaddlePaddle.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.