¡Activa las notificaciones laborales por email!

Senior Data Scientist (LLM)

TN Spain

Donostia/San Sebastián

Híbrido

EUR 60.000 - 100.000

Jornada completa

Hace 6 días
Sé de los primeros/as/es en solicitar esta vacante

Mejora tus posibilidades de llegar a la entrevista

Elabora un currículum adaptado a la vacante para tener más posibilidades de triunfar.

Descripción de la vacante

An established industry player is seeking a Senior Data Scientist to join their innovative team. This role involves creating high-quality datasets for training and fine-tuning Large Language Models (LLMs), ensuring data quality, diversity, and ethical compliance. You will collaborate with leading experts in a fast-paced environment, working on cutting-edge technologies that leverage quantum computing and AI. The company promotes sustainability, diversity, and an inclusive culture, offering a hybrid work opportunity and a comprehensive benefits package. If you're passionate about data science and want to make a significant impact, this is the perfect opportunity for you.

Servicios

Signing bonus
Private health insurance
Relocation package
Work visa sponsorship
Educational budget
Language classes
Discounted lunch options
Career plan
Progressive company culture
Opportunity to learn and teach

Formación

  • 3+ years of experience in data science and dataset creation for LLMs.
  • In-depth knowledge of the LLM lifecycle and data quality metrics.

Responsabilidades

  • Design and implement strategies for dataset creation for LLM training.
  • Develop scalable pipelines for data collection, cleaning, and validation.

Conocimientos

Python
Data Science
Machine Learning
NLP
Data Quality Metrics

Educación

Bachelor's in Computer Science
Master's in Data Science
Ph.D. in AI

Herramientas

Pandas
NumPy
spaCy
Hugging Face Datasets
Prodigy
Label Studio

Descripción del empleo

Social network you want to login/join with:

col-narrow-left

Client:

MULTIVERSE COMPUTING

Location:
Job Category:

Other

-

EU work permit required:

Yes

col-narrow-right

Job Views:

2

Posted:

27.04.2025

Expiry Date:

11.06.2025

col-wide

Job Description:

Come and join our multicultural team!

5 locations
+27 languages

Multiverse Computing

Multiverse is a well-funded and fast-growing deep-tech company founded in 2019. We are the biggest Quantum Software company in the EU. We are also one of the 100 most promising companies in AI in the world (according to CB Insights, 2023) with 150+ employees and growing, fully multicultural and international.

We provide hyper-efficient software to companies seeking to gain an edge with quantum computing and artificial intelligence. Our main products, Singularity and CompactifAI, address critical needs across various industries. Singularity remains a trusted solution for blue-chip companies in finance, energy, manufacturing, cybersecurity, and more. CompactifAI, on the other hand, is a groundbreaking compressing tool of foundational models that uses Tensor Networks to extremely compress AI systems, such as large language models, making these efficient and portable.

You will be working alongside world leading experts to build solutions that tackle real life issues. We look for passionate people that want to grow in an ethics driven environment, promoting sustainability and diversity. We aim to continue building our truly inclusive culture - come and join us.

We are seeking a Senior Data Scientist with deep expertise in creating high-quality datasets for training and fine-tuning Large Language Models (LLMs). You will be responsible for designing and implementing scalable data pipelines and strategies to support all stages of LLM development: pretraining, supervised fine-tuning, and reinforcement learning with human feedback (RLHF).

This role is critical to ensuring the robustness, safety, and alignment of our AI models. You will have the autonomy to explore innovative data sourcing and curation methods and the opportunity to directly influence the capabilities of state-of-the-art LLMs.

As a Senior Data Scientist, you will

  • Design and implement strategies for creating, sourcing, and augmenting datasets tailored for LLM training and fine-tuning.
  • Develop scalable pipelines to collect, clean, filter, annotate, and validate large volumes of text data.
  • Conduct data audits to ensure quality, diversity, ethical compliance, and bias mitigation.
  • Collaborate with ML engineers and researchers to align datasets with training objectives and model evaluation needs.
  • Use tools like Active Learning, synthetic data generation, and self-supervised learning to maximize dataset efficiency.
  • Leverage human-in-the-loop (HITL) workflows for data labeling and validation where necessary.
  • Contribute to building data documentation and metadata standards (e.g., Datasheets for Datasets).
  • Keep up to date with research trends in dataset curation, LLM pretraining data, and benchmarking.

Required Qualifications

  • Bachelor’s, Master’s, or Ph.D. in Computer Science, AI, Data Science, or a related field.
  • 3+ years of experience in data science, machine learning, or related roles, with demonstrated experience in dataset creation for NLP or LLMs.
  • In-depth knowledge of the LLM lifecycle: pretraining, fine-tuning, alignment, and evaluation.
  • Proficient in Python and data tooling ecosystems (Pandas, NumPy, spaCy, Hugging Face Datasets & Transformers).
  • Hands-on experience with text data collection from diverse sources: web scraping, APIs, proprietary corpora, etc.
  • Strong understanding of data quality metrics including bias detection, toxicity, and readability.
  • Experience working with annotation tools (e.g., Prodigy, Label Studio) and managing annotation teams or workflows.

Preferred Qualifications

  • Experience building or contributing to datasets used in LLM pretraining or supervised fine-tuning.
  • Familiarity with RLHF workflows and alignment techniques (e.g., preference modeling, reward modeling).
  • Exposure to multilingual and low-resource language datasets.
  • Contributions to open-source datasets, tools, or publications in dataset-centric research.
  • Knowledge of ethical AI, data governance, privacy laws (e.g., GDPR), and responsible data use.
  • Indefinite contract.
  • Signing bonus.
  • We offer work visa sponsorship (If applicable).
  • Relocation package (if applicable).
  • Private health insurance.
  • Eligibility for educational budget according to internal policy.
  • Hybrid opportunity.
  • Language classes and discounted lunch options
  • Working in a high paced environment, working on cutting edge technologies.
  • Career plan. Opportunity to learn and teach.
  • Progressive Company. Happy people culture

As an equal opportunity employer, Multiverse Computing is committed to building an inclusive workplace. The company welcomes people from all different backgrounds, including age, citizenship, ethnic and racial origins, gender identities, individuals with disabilities, marital status, religions and ideologies, and sexual orientations to apply.

Consigue la evaluación confidencial y gratuita de tu currículum.
o arrastra un archivo en formato PDF, DOC, DOCX, ODT o PAGES de hasta 5 MB.