Activez les alertes d’offres d’emploi par e-mail !

Backdoor Attack Scalability and Defense Evaluation in Large Language Models H/F

CEA

Gif-sur-Yvette

Sur place

EUR 20 000 - 40 000

Plein temps

Il y a 30+ jours

Générez un CV personnalisé en quelques minutes

Décrochez un entretien et gagnez plus. En savoir plus

Résumé du poste

A leading research organization in France is offering an internship to advance data poisoning attacks and defenses for large language models. The role involves implementing attack methods, testing model architectures, and establishing evaluation protocols. Candidates should have a strong background in computer science and programming skills in Python and C++. Comfort in English is essential for documentation. This opportunity is based in Gif-sur-Yvette and suitable for Master 2 graduates.

Qualifications

Strong background in computer science or related field focusing on machine learning security.
Experience with machine learning systems and model training is a plus.
Ability to collaborate in research-driven environments.

Responsabilités

Implement state-of-the-art attack methods across multiple vectors.
Test attacks on diverse model architectures and scales.
Establish standardized evaluation protocols with metrics such as Attack Success Rate.
Evaluate existing defenses and develop reproducible test suites.

Connaissances

Machine learning security

Adversarial machine learning

Programming in Python

Programming in C++

Independent work

Research collaboration

Documentation in English

Formation

Bac+5 - Master 2

Computer Science

Overview

Backdoor Attack Scalability and Defense Evaluation in Large Language Models H/F

Category

Mathematics, information, scientific, software

Contract

Internship

Job title

Backdoor Attack Scalability and Defense Evaluation in Large Language Models H/F

Subject

Large Language Models (LLMs) deployed in safety-critical domains are increasingly vulnerable to backdoor and data poisoning attacks. Recent studies show that even a small number of poisoned samples can compromise models at massive scales, highlighting urgent security challenges. This internship focuses on empirically testing and advancing poisoning attacks and defenses in LLMs through systematic experimentation and adversarial evaluation. Tasks include implementing state-of-the-art attack methods (e.g., jailbreaks, denial-of-service, data extraction), evaluating defenses, analyzing attack scalability across model sizes, and establishing standardized evaluation metrics such as Attack Success Rate and Clean Accuracy to support reproducible benchmarking and robust model defense strategies.

Context

Large Language Models (LLMs) deployed in safety-critical domains face significant threats from backdoor attacks. Recent empirical evidence contradicts previous assumptions about attack scalability: poisoning attacks remain effective regardless of model or dataset size, requiring as few as 250 poisoned documents to compromise models from up to 13B parameters. This suggests data poisoning becomes easier, not harder, as systems scale.

Backdoors persist through post-training alignment techniques like Supervised Fine-Tuning and Reinforcement Learning from Human Feedback, compromising current defenses. However, persistence depends critically on poisoning timing and backdoor characteristics. Current verification methods are computationally prohibitive—Proof-of-Learning requires full model retraining and complete training transcript access. While step-wise verification shows promise for runtime detection, scalability to production models and resilience against adaptive adversaries remain unresolved.

Existing defenses focus on post-training detection rather than preventing attack success during training. Advancing data poisoning scaling dynamics—understanding how attack success correlates with dataset composition, poisoning density, and model capacity—is essential for developing evidence-based threat models and defense strategies.

Objective

This internship aims to empirically test and advance data poisoning attacks and defenses for LLMs through systematic experimentation and adversarial evaluation. Key responsibilities include: implementing state-of-the-art attack methods across multiple vectors (jailbreaking, targeted refusal, denial-of-service, information extraction); testing attacks on diverse model architectures and scales; establishing standardized evaluation protocols with metrics such as Attack Success Rate and Clean Accuracy; evaluating existing defenses, particularly step-wise verification; and developing reproducible test suites for objective defense benchmarking.

Requirements

Background in computer science or a related field, with a focus on machine learning security, or adversarial machine learning.
Strong programming skills in languages commonly used for machine learning tasks (e.g., Python, C++).
Experience with machine learning systems, model training, or adversarial robustness is a plus.
Ability to work independently and collaborate in a research-driven environment.
Comfortable working in English, essential for documentation purposes.

Site

Job location

Location

Gif-sur-Yvette

Languages

Prepared diploma

Bac+5 - Master 2

Recommended training

Computer Science

PhD opportunity

Oui

Requester

27/10/2025

General information

Organisation

The French Alternative Energies and Atomic Energy Commission (CEA) is a key player in research, development and innovation in four main areas: defence and security, nuclear energy (fission and fusion), technological research for industry, and fundamental research in the physical sciences and life sciences. Drawing on its widely acknowledged expertise, and thanks to its 16,000 technicians, engineers, researchers and staff, the CEA actively participates in collaborative projects with a large number of academic and industrial partners. The CEA is established in ten centers spread throughout France.

Obtenez votre examen gratuit et confidentiel de votre CV.

ou faites glisser et déposez un fichier PDF, DOC, DOCX, ODT ou PAGES jusqu’à 5 Mo.

Noté « Excellent » sur la base de 19 192 évaluations