Aktiviere Job-Benachrichtigungen per E-Mail!
A leading AI research lab in Germany is looking for a new team member to enhance dataset quality for training models. The role involves working closely with various teams to ensure high data quality and conducting experiments to improve dataset generation. Ideal candidates should have experience in dataset design and a solid understanding of machine learning. Join an elite team with a strong engineering culture!
About Poolside
We are software's leading AI research lab.
We are a frontier lab focused on building the most capable models and systems to support them. Our models are generally capable and are purpose-built specifically to excel at software engineering. Our proprietary approach and techniques allow our models to learn like the best developers through trial and error, navigating ambiguity to discover working solutions.
About the Role
You would be working on our data team focused on the quality of the datasets being delivered for training our models. This is a hands-on role where your #1 mission would be to improve the quality of the pretraining datasets by leveraging your previous experience, intuition and training experiments. This includes synthetic data generation and data mix optimization.
You would be closely collaborating with other teams like Pre-training, Fine-tuning and Product to define high-quality data both quantitatively and qualitatively.
Staying in sync with the latest research in the field of dataset design and pretraining is key for being successful in a role where you would be constantly showing original research initiatives with short time-bounded experiments and highly technical engineering competence while deploying your solutions in production. With the volumes of data to process being massive, you'll have at your disposal a performant distributed data pipeline together with a large GPU cluster.
Why you should join
Code the future of AI-powered development – Build the scalable platform that powers Poolside's fine-tuning efforts, directly impacting how our foundational models learn and improve
Series C funding imminent – Join a $500M+ Series B startup that's about to close an even larger round, with massive compute resources and runway for years
Elite engineering culture – 75% of the 120-person team is engineering, working alongside ex-GitHub CTO Jason Warner and top-tier talent from Snap, GitHub, and other leading companies
Your Mission
To deliver massive-scale datasets of natural language and source code with the highest quality for training poolside models.
Responsibilities
Follow the latest research related to LLMs and data quality in particular. Be familiar with the most relevant open-source datasets and models
Closely work with other teams such as Pre-training, Fine-tuning or Product to ensure short feedback loops on the quality of the models delivered
Suggest, conduct and analyze data ablations or training experiments that aim to improve the quality of the datasets generated via quantitative insights