Enable job alerts via email!
Boost your interview chances
Create a job specific, tailored resume for higher success rate.
A leading company is seeking a Data Engineer with extensive experience in building data pipelines and managing datasets for AI applications. The ideal candidate will have a strong background in Python and data engineering, with a focus on optimizing data for LLMs. Responsibilities include creating high-quality datasets, managing data versioning, and ensuring efficient data retrieval workflows. This role offers the opportunity to work on cutting-edge AI projects in a dynamic environment.
Build ingestion pipelines for structured/unstructured data using Python
Clean normalize and prepare data formats suitable for LLM finetuning (e.g. JSONL CSV)
Create highquality taskspecific datasets for training and evaluation
Apply versioning to datasets using DVC or LakeFS for reproducibility
Generate embeddings using HuggingFace or Sentence Transformers
Manage vector indexes (FAISS Weaviate) and optimize retrieval workflows
Tokenize and chunk longform data for context window optimization
10 years experience in Data Engineering role
2 years experience in AIadjacent data role
Proficiency in Python pandas and text processing tools
Familiarity with tokenization libraries (HuggingFace Tokenizers SentencePiece)
Experience managing datasets and object storage (MinIO NFS)
Understanding of LLM data constraints (context windows formatting prompt injection)