Enable job alerts via email!

Backend Software Engineer

COL Limited

Greater London

On-site

GBP 100,000 - 200,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A pioneering AI research organization is seeking a Backend Software Engineer to develop innovative tools for frontier AGI safety research, focusing on internal tooling and automated evaluation pipelines. You'll lead major feature development, collaborate closely with researchers, and advocate for strong software design practices. Ideal candidates bring over 5 years of professional experience in software engineering, preferably in Python, and have a passion for building impactful software. This is a full-time, in-person role in London, with a competitive salary and comprehensive benefits.

Benefits

Salary: 100k - 200k GBP

Flexible work hours

Unlimited vacation and sick leave

Lunch, dinner, and snacks provided

Yearly professional development budget of $1,000

Paid work trips and retreats

Qualifications

5+ years of professional software engineering experience.
Experience leading successful software tools or products.
Background in LLM evaluations or agents is a plus.

Responsibilities

Prototype and iterate on internal tools for language model evaluations.
Lead major feature development from ideation to implementation.
Collaborate with researchers to address challenges.

Skills

Production-quality Python code

Collaboration

Software design practices

Debugging and implementation

Applications deadline: We accept submissions until 15 January 2026. We review applications on a rolling basis and encourage early submissions.

ABOUT THE OPPORTUNITY

We’re looking for Backend Software Engineers who are excited to build tools for frontier AGI safety research, e.g. building and maintaining evals libraries and tools for monitoring and controlling our own LLM traffic.

REPRESENTATIVE PROJECTS

Here is a list of example projects which you might build and ship in your first 6 months.

Internal tooling for efficiently running and analyzing evaluations. For example, a tool that quickly investigates thousands of agentic eval runs in parallel and surfaces interesting information automatically
Automated evaluation pipelines to minimize the time from getting access to a new model for pre-deployment testing to analyzing the most important results and sharing them
Orchestration tools that allow researchers to run thousands of agentic evaluations in parallel on remote machines with high security and reliability
LLM proxy service that enables us to monitor all of our coding agent traffic in real time and identify undesired behavior automatically (in the spirit of Control)
LLM agents and MCP tools to automate internal software engineering and research tasks, with sandboxes to prevent major failures
CI pipeline optimisations to reduce execution time and eliminate flaky tests
Telemetry API and instrumentation of our existing tools, allowing us to monitor usage and improve reliability
Data warehousing pipeline and service to store thousands of eval transcripts which researchers can study and build datasets from
Upstream improvements to the Inspect framework and ecosystem, e.g. support for evaluating modern agentic scaffolds.

KEY RESPONSIBILITIES

Rapidly prototype and iterate on internal tools and libraries for building and running frontier language model evaluations
Lead the development of major features from ideation to implementation
Collaboratively define and shape the software roadmap and priorities
Establish and advocate for good software design practices, codebase health, and coding agent practices
Work closely with researchers to understand what challenges they face Assist researchers with implementation and debugging of research code
Communicate clearly about technical decisions and tradeoffs

KEY REQUIREMENTS

You must have experience writing production-quality python code
We value candidates from diverse backgrounds and recognise that candidates may demonstrate their skills in different ways.
For example, we might be impressed if you have:
Led the development of a successful software tool or product over an extended period (e.g. 1 year or more)
Started and built the tech stack for a company, e.g in a start-up
Worked your way up in a large organisation, repeatedly gaining more responsibility and influencing a large part of the codebase
Authored and/or maintained a popular open-source tool or library
Placed in a prestigious programming competition (IOI, ICPC, etc.)
5+ years of professional software engineering experience

Bonus

Experience working with LLM agents or LLM evaluations
Infosecurity / cybersecurity experience
Experience working with AWS
Interest in AI Safety

We want to emphasize that people who feel they don’t fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine.

LOGISTICS

Start Date: Target of 2-3 months after the first interview
Time Allocation: Full-time
Location: The office is in London, and the right next to the London Initiative for Safe AI (LISA) offices. This is an in-person role. In rare situations, we may consider partially remote arrangements on a case-by-case basis.
Work Visas: We can sponsor UK visas

BENEFITS

Salary: 100k - 200k GBP (~135k - 270k USD)
Flexible work hours and schedule
Unlimited vacation
Unlimited sick leaveLunch, dinner, and snacks are provided for all employees on workdays
Paid work trips, including staff retreats, business trips, and relevant conferences
A yearly $1,000 (USD) professional development budget

ABOUT APOLLO RESEARCH

The rapid rise in AI capabilities offer tremendous opportunities, but also present significant risks. At Apollo Research, we’re primarily concerned with risks from Loss of Control, i.e. risks coming from the model itself rather than e.g. humans misusing the AI. We’re particularly concerned with deceptive alignment / scheming, a phenomenon where a model appears to be aligned but is, in fact, misaligned and capable of evading human oversight. We work on the detection of scheming (e.g. building evaluations), the science of scheming (e.g. model organisms), and scheming mitigations (e.g. anti-scheming, and control). We closely work with multiple frontier AI companies, e.g. to test their models before deployment or collaborate on scheming mitigations. At Apollo, we aim for a culture that emphasizes truth-seeking, being goal-oriented, giving and receiving constructive feedback, and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.

ABOUT THE TEAM

The SWE team currently consists of Rusheb Shah, Andrei Matveiakin, Alex Kedrik, and Glen Rodgers. Beyond the SWE team, you will closely interact with the research scientists and engineers as the primary user group of your tools. You can find our full team here.

Equality Statement

Apollo Research is an Equal Opportunity Employer. We value diversity and are committed to providing equal opportunities to all, regardless of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex, or sexual orientation.

INTERVIEW PROCESS

Please complete the application form with your CV. The provision of a cover letter is optional but not necessary. Please also feel free to share links to relevant work samples.

About the interview process: Our multi-stage process includes a screening interview, a take-home test (approx. 2 hours), 3 technical interviews, and a final interview with Marius (CEO). The technical interviews will be closely related to tasks the candidate would do on the job. There are no leetcode-style general coding interviews. If you want to prepare for the interviews, we suggest working on hands‑on LLM evals projects (e.g. as suggested in our starter guide), such as building LM agent evaluations in Inspect.

Applications deadline: We are reviewing applications on a rolling basis. It might take a few weeks until you hear from us.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top locations

Top companies

Top positions