Enable job alerts via email!

Evaluations Engineer

COL Limited

City Of London

On-site

GBP 100,000 - 200,000

Full time

17 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading research organization in London is hiring Evaluation Engineers to oversee evaluation campaigns for cutting-edge AI models. The role involves automating pipelines, improving evaluation processes, and working closely with frontier labs. Ideal candidates will have a strong background in Python and data analysis, and be passionate about AI model testing. This full-time, in-person position offers a competitive salary, flexible hours, and numerous benefits.

Benefits

Flexible work hours and schedule

Unlimited vacation

Lunch, dinner, and snacks provided

Paid work trips and conferences

Yearly professional development budget

Qualifications

Strong software engineering experience in Python.
Comfortable with quantitative analysis and qualitative assessment.
Ability to convey findings succinctly to various audiences.

Responsibilities

Run and own evaluation campaigns for unreleased models.
Automate evaluation pipeline and improve infrastructure.
Develop a larger vision for evaluation processes.

Skills

Software engineering skills

Process optimisation

Data Analysis & Pattern Recognition

Writing and communication

AI power-user

Overview

Applications deadline: We're accepting applications until 03 January 2026. We encourage early submissions and will start interviews in December 2025.

About the Opportunity

We’re looking for Evaluation Engineers who will run and own “evaluation campaigns” (pre-deployment testing for unreleased frontier models), build out our evaluation infrastructure, and automate the evals pipeline.

You will get to work with frontier labs like OpenAI, Anthropic, and Google DeepMind and be amongst the first to interact with new models before anyone else.

The ideal candidate loves rigorously testing frontier AI models, and enjoys building efficient pipelines and automating them.

Key Responsibilities

Run and own “evaluation campaigns”: We run an evals campaign approximately every two weeks with thousands of runs across hundreds of distinct environments. Our typical workflow for this process includes running all of our evaluations at scale, going through the resulting transcripts quickly using LLM assistance to find the most interesting evidence for the capabilities and propensities of the model, finding novel AI behaviors that no one else has ever observed before, e.g. the non-standard language described in our anti-scheming paper, diving deeper into this evidence and running targeted follow-up experiments, compiling the most interesting pieces of evidence and sharing them with the AI developer, and engaging with feedback and answering questions from the AI developer.
Automate the evaluation pipeline: by improving our infrastructure, building more efficient processes, building and improving agentic workflows that quickly scan the results and provide preliminary conclusions, and more. We’re already using automation across all parts of the pipeline, i.e. building, running and analyzing the evals. This includes both classic pipeline automation procedures as well as LLM-based workflows.
Improve our evaluations: by building new and better evaluations for frontier risks or including publicly available evaluations.
Develop a larger vision for what the best possible evaluation pipeline should look like a year from now.

Key Requirements

We don’t require a formal background or industry experience and welcome self-taught candidates.
Software engineering skills: Our entire stack uses Python. We\'re looking for candidates with strong software engineering experience. Ideally, you have experience shipping and maintaining production Python code, and know how to factor messy problems into clean abstractions that others can use and extend.
Process optimisation: You always try to improve workflows. Pre-deployment evaluations are very fast paced so ideally you love shaving friction off your workflows wherever possible.
Data Analysis & Pattern Recognition: You can extract signal from large, messy datasets. You\'re comfortable with quantitative analysis and know when qualitative assessment is more appropriate. You can identify anomalies and unexpected model behaviors.
Writing and communication: You succinctly convey qualitative and quantitative findings to a technical and non-technical audience.
AI power-user: You are curious about the capabilities and propensities of frontier AI models. You have experience using different models, know which ones to use for which tasks, when not to use AI, and you always experiment with new AI workflows
(Bonus) We are using Inspect as our primary evals framework, and we value experience with it.
We want to emphasize that people who feel they don’t fulfill all of these characteristics but think they would be a good fit for the position, nonetheless, are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine.

Logistics

Start Date: Target of 2-3 months after the first interview
Time Allocation: Full-time
Location: The office is in London, and the building is shared with the London Initiative for Safe AI (LISA) offices.
This is an in-person role. In rare situations, we may consider partially remote arrangements on a case-by-case basis.
Work Visas: We can sponsor UK visas

Benefits

Salary: 100k - 200k GBP (~135k - 270k USD)
Flexible work hours and schedule
Unlimited vacation
Unlimited sick leave
Lunch, dinner, and snacks are provided for all employees on workdays
Paid work trips, including staff retreats, business trips, and relevant conferences
A yearly $1,000 (USD) professional development budget

About Apollo Research

The rapid rise in AI capabilities offer tremendous opportunities, but also present significant risks. At Apollo Research, we’re primarily concerned with risks from Loss of Control, i.e. risks coming from the model itself rather than e.g. humans misusing the AI. We’re particularly concerned with deceptive alignment / scheming, a phenomenon where a model appears to be aligned but is, in fact, misaligned and capable of evading human oversight. We work on the detection of scheming (e.g., building evaluations), the science of scheming (e.g., model organisms), and scheming mitigations (e.g., anti-scheming and control). We closely work with multiple frontier AI companies, e.g. to test their models before deployment or collaborate on scheming mitigations. At Apollo, we aim for a culture that emphasizes truth-seeking, being goal-oriented, giving and receiving constructive feedback, and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.

About the Team

The current evals team consists of Jérémy Scheurer, Alex Meinke, Rusheb Shah, Bronson Schoen, Andrei Matveiakin, Felix Hofstätter, Axel Højmark, Teun van der Weij, Alex Lloyd, Alex Kedryk and Glen Rodgers. Alex Meinke leads and Marius Hobbhahn advises the evals team, though team members lead individual projects. You will mostly work with the evals team, but you will likely sometimes interact with the governance team to translate technical knowledge into concrete recommendations. You can find our full team here.

Equality Statement: Apollo Research is an Equal Opportunity Employer. We value diversity and are committed to providing equal opportunities to all, regardless of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex, or sexual orientation.

How to apply

Please complete the application form with your CV. The provision of a cover letter is not necessary. Please also feel free to share links to relevant work samples. About the interview process: Our multi-stage process includes a screening interview, a take-home test (approx. 2.5 hours), 3 technical interviews, and a final interview with Marius (CEO). The technical interviews will be closely related to tasks the candidate would do on the job. There are no LeetCode-style general coding interviews. If you want to prepare for the interviews, we suggest running existing evaluations in Inspect evals and using those results to compare different models.

Applications deadline: We\'re accepting applications until 03 January 2026. We encourage early submissions and will start interviews in December 2025.

Your Privacy and Fairness in Our Recruitment Process: We are committed to protecting your data, ensuring fairness, and adhering to workplace fairness principles in our recruitment process. To enhance hiring efficiency, we use AI-powered tools to assist with tasks such as resume screening. These tools are designed and deployed in compliance with internationally recognized AI governance frameworks. Your personal data is handled securely and transparently. If you have questions about how your data is processed or wish to report concerns about fairness, please contact us at info@apolloresearch.ai.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs