Activez les alertes d’offres d’emploi par e-mail !

Stagiaire Data Engineer / Data Enablement with AI for AI

Cast

Meudon

Sur place

EUR 45 000 - 70 000

Plein temps

Aujourd’hui

Soyez parmi les premiers à postuler

Générez un CV personnalisé en quelques minutes

Décrochez un entretien et gagnez plus. En savoir plus

Résumé du poste

A leading software intelligence company in Meudon is seeking a Data Engineer to leverage AI tools for data curation and management. The ideal candidate will aggregate software project data, automate cleaning and structuring processes using advanced tools, and collaborate with AI teams to enhance data schemas. Strong Python skills and familiarity with LLMs and NLP tools are essential. This role presents an exciting opportunity to be part of an Agile team driving innovation in software intelligence.

Qualifications

Experience in data engineering, ML data ops, or structured data curation.
Proficient in Python, with strong data pipeline skills.
Experience with LLMs or NLP tools.

Responsabilités

Aggregate and structure data from software ecosystems.
Automate data cleaning, entity extraction, and semantic annotation.
Collaborate with AI teams to evolve schemas and labeling strategies.

Connaissances

Data engineering

Machine Learning operations

Python

Data pipeline skills

LLMs

NLP tools

Tokenization

Chunking

Outils

Pandas

PyArrow

Regex

Airflow

Hugging Face

spaCy

LangChain

CAST, a Software Company based in Meudon , is the market leader in Software Intelligence.

Working at CAST R&D means being an important part of a highly-talented, fast-paced, multicultural and Agile team .

Overview

We’re building the foundation to ground AI with AAA Software Intelligence — Aggregated,

Accurated, and Augmented — sourced from real-world software and technology projects. This role goes beyond manual curation: it's about using AI to empower AI. You will leverage LLMs, embeddings, and NLP tools to clean, enrich, and validate data, enabling AI systems and autonomous agents to rely on it for training and contextual understanding.

Responsibilities

Aggregate and structure data from software ecosystems (codebases, APIs, tickets, documentation, architecture specs).
Apply LLMs, embeddings, and NLP tools to automate: data cleaning, entity extraction, metadata tagging, and semantic annotation.
Build and maintain semantic pipelines for LLM fine-tuning and RAG (Retrieval-Augmented Generation).
Organize datasets into formats suitable for Agent-to-Agent (A2A) interactions: APIs, vector DBs, knowledge graphs, etc.
Collaborate with AI teams to evolve schemas, prompts, labeling strategies, and evaluation data.
Ensure strong data lineage, reproducibility, and version control.

Requirements

Experience in data engineering, ML data ops, or structured data curation.
Proficient in Python, with strong data pipeline skills (Pandas, PyArrow, regex, Airflow).
Experience with LLMs or NLP tools (e.g., Hugging Face, spaCy, LangChain).
Ability to use AI to clean, enrich, classify, and organize technical content.
Strong understanding of tokenization, chunking, and model input preparation.
Experience working with software project data: Git repos, APIs, technical documentation, etc.

Bonus Skills

Knowledge of vector DBs (FAISS, Qdrant, Weaviate) or knowledge graphs (Neo4j, RDF, SPARQL).

Obtenez votre examen gratuit et confidentiel de votre CV.

ou faites glisser et déposez un fichier PDF, DOC, DOCX, ODT ou PAGES jusqu’à 5 Mo.

Noté « Excellent » sur la base de 19 336 évaluations

Meilleures villes

Principales entreprises

Offres d’emploi populaires