Job Search and Career Advice Platform

Enable job alerts via email!

Senior ML Systems Engineer: Training Frameworks & Tools

Cohere

Greater London

Hybrid

GBP 75,000 - 114,000

Full time

8 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading AI research organization in Greater London is seeking a skilled Senior Engineer to enhance its training frameworks for frontier-scale language models. The role involves developing distributed training solutions, optimizing performance across multi-node clusters, and ensuring stable, scalable training systems. The ideal candidates should have a strong background in distributed systems, coding, and collaboration. Excellent benefits include a flexible work environment, generous vacation, and health support.

Benefits

An open and inclusive culture
Weekly lunch stipend and snacks
Full health and dental benefits
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits
Remote-flexible work options
6 weeks of vacation

Qualifications

  • Strong engineering experience in large-scale distributed training or HPC systems.
  • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.

Responsibilities

  • Build and own the training framework responsible for large‑scale LLM training.
  • Design distributed training abstractions to improve training throughput.
  • Collaborate closely with infra teams for high-performance training.
  • Investigate and resolve performance bottlenecks across the ML systems stack.

Skills

Strong engineering experience in large-scale distributed training or HPC systems
Familiarity with JAX internals
Experience with multi-node cluster orchestration
Comfort debugging performance issues
Experience working with containerized environments
Building tools that increase developer velocity
Strong collaboration skills

Tools

Docker
Slurm
Ray
Kubernetes
Job description
A leading AI research organization in Greater London is seeking a skilled Senior Engineer to enhance its training frameworks for frontier-scale language models. The role involves developing distributed training solutions, optimizing performance across multi-node clusters, and ensuring stable, scalable training systems. The ideal candidates should have a strong background in distributed systems, coding, and collaboration. Excellent benefits include a flexible work environment, generous vacation, and health support.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.