Job Search and Career Advice Platform

Enable job alerts via email!

Remote Software Engineer - Distributed ML Training

Gensyn

Remote

GBP 70,000 - 90,000

Full time

Yesterday
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading machine learning compute protocol company in Greater London is seeking a talented individual to design and implement systems for machine learning execution. Candidates should possess a strong background in distributed systems and networking, with desire to learn Rust. The role offers competitive salary, fully remote work options, and generous benefits including company retreats and comprehensive health insurance for eligible employees.

Benefits

Competitive salary + share of equity and token pool
Fully remote work
Relocation assistance
4x all expenses paid company retreats
Whatever equipment you need
Paid sick leave
Private health, vision, and dental insurance

Qualifications

  • Experience in designing and/or working with training systems on large clusters.
  • Strong understanding and troubleshooting experience of common networking protocols.
  • Familiarity with large open source codebases as either maintainer or contributor.
  • Willingness to learn Rust as part of the company culture.
  • Broad knowledge of algorithms and data structures.

Responsibilities

  • Design and implement orchestration systems for ML execution.
  • Profile and optimize training algorithms continuously.
  • Build out mechanisms and algorithms for unsolved problems.
  • Collaborate on wider ML engineering issues.
  • Contribute to technical reports and community discussions.

Skills

Distributed foundation model training
Networking protocols (IP, TCP, UDP, HTTP)
Open source contribution
Rust programming
Computer science background
Communication skills
Applied research environment experience

Education

Computer Science degree
Job description
The world will be unrecognisable in 5 years.

Machine learning models are driving our cars, testing our eyesight, detecting our cancer, giving sight to the blind, giving speech to the mute, and dictating what we consume, enjoy, and think. These AI systems are already an integral part of our lives and will shape our future as a species.

Soon, we'll conjure unlimited content: from never-ending TV series (where we’re the main character) to personalised tutors that are infinitely patient and leave no student behind. We’ll augment our memories with foundation models—individually tailored to us through RLHF and connected directly to our thoughts via Brain-Machine Interfaces—blurring the lines between organic and machine intelligence and ushering in the next generation of human development.

This future demands immense, globally accessible, uncensorable, computational power. Gensyn is the machine learning compute protocol that translates machine learning compute into an always-on commodity resource—outside of centralised control and as ubiquitous as electricity—accelerating AI progress and ensuring that this revolutionary technology is accessible to all of humanity through a free market.

Our Principles:
AUTONOMY
  • Don’t ask for permission - we have a constraint culture, not a permission culture.
  • Claim ownership of any work stream and set its goals/deadlines, rather than waiting to be assigned work or relying on job specs.
  • Push & pull context on your work rather than waiting for information from others and assuming people know what you’re doing.
  • No middle managers - we don’t (and will likely never) have middle managers.
FOCUS
  • Small team - misalignment and politics scale super-linearly with team size. Small protocol teams rival much larger traditional teams.
  • Thin protocol - build and design thinly.
  • Reject waste - guard the company’s time, rather than wasting it in meetings without clear purpose/focus, or bikeshedding.
REJECT MEDIOCRITY
  • Give direct feedback to everyone immediately rather than avoiding unpopularity, expecting things to improve naturally, or trading short-term pain for extreme long-term pain.
  • Embrace an extreme learning rate rather than assuming limits to your ability/knowledge.
  • Drive-push areas of ownership to final outcome, despite any barriers.
Responsibilities
  • Design/implement system for orchestration of ML execution - enable training across our uniquely decentralised and heterogeneous infrastructure.
  • Performance optimisation - continually profile and optimise our training algorithms.
  • Implement novel research - build out newly proposed mechanisms and algorithms to solve never-tackled-before problems.
  • Engineering support - work with the rest of the team on wider issues concerning ML (e.g. reproducible training).
  • Write & engage - contribute to technical reports/papers describing the system and discuss with the community.
Minimum requirements
  • Hands-on distributed foundation model training - experience designing and/or working with training systems on large clusters.
  • Networking - understanding and troubleshooting experience of the most common networking protocols: IP, TCP, UDP, HTTP, and experience in communications backends e.g. NCCL, GLOO and MPI.
  • Open source work - experience working with large open source codebases - either as maintainer or trusted contributor.
  • Strong willingness to learn Rust - as a Rust by default company, we require that everyone learns Rust so that they have context/can work across the entire codebase.
  • Computer science background - understanding of computational complexity (time, space) and broad knowledge of algorithms and data structures.
  • Highly self-motivated with excellent verbal and written communication skills.
  • Comfortable working in an applied research environment - with extremely high autonomy and unpredictable timelines.
Nice to haves
  • Rust - strong experience with systems programming in Rust (you know what a 'lifetime' is and understand the purpose of Pin).
  • Research background - published research in the distributed systems or ML domains.
  • Blockchain - understanding of blockchain fundamentals.
Compensation / Benefits:
  • Competitive salary + share of equity and token pool.
  • Fully remote work - we hire between the West Coast (PT) and Central Europe (CET) time zones.
  • Relocation Assistance - available for those that would like to move to a different location after being hired.
  • 4x all expenses paid company retreats around the world, per year.
  • Whatever equipment you need.
  • Paid sick leave.
  • Private health, vision, and dental insurance - including spouse/dependents [🇺🇸 only].
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.