Aktiviere Job-Benachrichtigungen per E-Mail!

Member of Engineering, Pre-training/data (Remote)

Poolside

Deutschland

Remote

EUR 60.000 - 90.000

Vollzeit

Vor 14 Tagen

Zusammenfassung

A leading AI research lab in Germany is looking for a new team member to enhance dataset quality for training models. The role involves working closely with various teams to ensure high data quality and conducting experiments to improve dataset generation. Ideal candidates should have experience in dataset design and a solid understanding of machine learning. Join an elite team with a strong engineering culture!

Leistungen

Elite engineering culture
Access to large GPU clusters
Opportunity to work alongside top-tier talent

Qualifikationen

  • Hands-on experience in improving dataset quality.
  • Familiarity with the latest research related to LLMs.
  • Ability to conduct quantitative data experiments.

Aufgaben

  • Follow the latest research related to LLMs and data quality.
  • Work closely with the Pre-training and Fine-tuning teams.
  • Conduct and analyze data ablation experiments.

Kenntnisse

Experience in dataset design
Understanding of LLMs
Data analysis skills
Jobbeschreibung

About Poolside

We are software's leading AI research lab.

We are a frontier lab focused on building the most capable models and systems to support them. Our models are generally capable and are purpose-built specifically to excel at software engineering. Our proprietary approach and techniques allow our models to learn like the best developers through trial and error, navigating ambiguity to discover working solutions.

About the Role

You would be working on our data team focused on the quality of the datasets being delivered for training our models. This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments. This includes synthetic data generation and data mix optimization.

You would be closely collaborating with other teams like Pre-training, Fine-tuning and Product to define high-quality data both quantitatively and qualitatively.

Staying in sync with the latest research in the field of dataset design and pretraining is key for being successful in a role where you would be constantly showing original research initiatives with short time-bounded experiments and highly technical engineering competence while deploying your solutions in production. With the volumes of data to process being massive, you'll have at your disposal a performant distributed data pipeline together with a large GPU cluster.

Why you should join

  • Code the future of AI-powered development – Build the scalable platform that powers Poolside's fine-tuning efforts, directly impacting how our foundational models learn and improve

  • Series C funding imminent – Join a $500M+ Series B startup that's about to close an even larger round, with massive compute resources and runway for years

  • Elite engineering culture – 75% of the 120-person team is engineering, working alongside ex-GitHub CTO Jason Warner and top-tier talent from Snap, GitHub, and other leading companies

Your Mission

To deliver massive-scale datasets of natural language and source code with the highest quality for training poolside models.

Responsibilities

  • Follow the latest research related to LLMs and data quality in particular. Be familiar with the most relevant open-source datasets and models

  • Closely work with other teams such as Pre-training, Fine-tuning or Product to ensure short feedback loops on the quality of the models delivered

  • Suggest, conduct and analyze data ablations or training experiments that aim to improve the quality of the datasets generated via quantitative insights

Hol dir deinen kostenlosen, vertraulichen Lebenslauf-Check.
eine PDF-, DOC-, DOCX-, ODT- oder PAGES-Datei bis zu 5 MB per Drag & Drop ablegen.