Activez les alertes d’offres d’emploi par e-mail !

Data Engineer Data Enablement with AI for AI

CAST Software

Meudon

Sur place

EUR 50 000 - 70 000

Plein temps

Il y a 24 jours

Résumé du poste

A leading software intelligence company in Meudon is seeking a Data Engineer to join their R&D team. This role involves aggregating and structuring data using AI technologies, with a focus on LLMs and NLP tools. Ideal candidates will have at least 3 years of experience in data engineering, proficiency in Python, and a solid understanding of tokenization and chunking. This full-time position offers the opportunity to work in a fast-paced and innovative environment.

Qualifications

  • 3 years in data engineering, ML, data ops, or structured data curation.
  • Proficient in Python with strong data pipeline skills.
  • Experience with LLMs or NLP tools.
  • Strong understanding of tokenization, chunking, and model input preparation.

Responsabilités

  • Aggregate and structure data from software ecosystems.
  • Apply LLMs embeddings and NLP tools to automate data cleaning and entity extraction.
  • Build and maintain semantic pipelines for LLM fine-tuning.
  • Organize datasets into formats suitable for A2A interactions.
  • Collaborate with AI teams to evolve schemas and evaluation data.
  • Ensure strong data lineage and reproducibility.

Connaissances

Python
Data pipeline skills
LLMs or NLP tools
Tokenization
Data Engineering
Data Ops
Structured data curation
Apache Hive
S3
Hadoop
Redshift
Spark
AWS
Apache Pig
NoSQL
Big Data
Data Warehouse
Kafka
Scala
Description du poste
Overview

CAST a Software Company based in Meudon is the market leader in Software Intelligence.

Working at CAST R&D means being an important part of a highly-talented fast-paced multicultural and Agile team.

We’re building the foundation to ground AI with AAA Software Intelligence Aggregated Accurate and Augmented sourced from real-world software and technology projects. This role goes beyond manual curation: its about using AI to empower AI. You will leverage LLMs embeddings and NLP tools to clean enrich and validate data enabling AI systems and autonomous agents to rely on it for training and contextual understanding.

Responsibilities

Aggregate and structure data from software ecosystems (codebases APIs tickets documentation architecture specs).

Apply LLMs embeddings and NLP tools to automate: data cleaning entity extraction metadata tagging and semantic annotation.

Build and maintain semantic pipelines for LLM fine-tuning and RAG (Retrieval-Augmented Generation).

Organize datasets into formats suitable for Agent-to-Agent (A2A) interactions: APIs vector DBs knowledge graphs etc.

Collaborate with AI teams to evolve schemas prompts labeling strategies and evaluation data.

Ensure strong data lineage reproducibility and version control.

Requirements

3 years in data engineering ML data ops or structured data curation.

Proficient in Python with strong data pipeline skills (Pandas PyArrow regex Airflow).

Experience with LLMs or NLP tools (e.g. Hugging Face spaCy LangChain).

Ability to use AI to clean enrich classify and organize technical content.

Strong understanding of tokenization chunking and model input preparation.

Experience working with software project data: Git repos APIs technical documentation etc.

Bonus Skills

Knowledge of vector DBs (FAISS Qdrant Weaviate) or knowledge graphs (Neo4j RDF SPARQL).

Key Skills
  • Apache Hive
  • S3
  • Hadoop
  • Redshift
  • Spark
  • AWS
  • Apache Pig
  • NoSQL
  • Big Data
  • Data Warehouse
  • Kafka
  • Scala
Employment Type

Full-Time

Vacancy

1

Obtenez votre examen gratuit et confidentiel de votre CV.
ou faites glisser et déposez un fichier PDF, DOC, DOCX, ODT ou PAGES jusqu’à 5 Mo.