LLM Data Engineer | United States | Fully Remote
Join us at Halo Media as an AI/LLM Data Engineer to develop and maintain data pipelines for our Generative AI platform. The ideal candidate will be experienced in Large Language Model (LLM) technologies, data engineering, and techniques like Retrieval-Augmented Generation (RAG) and knowledge-base integration. This role reports to the Director of AI Solutions & Development within the AI COE, working on strategic projects with cross-functional teams to deliver innovative AI solutions.
Responsibilities
- Design, implement, and maintain multi-stage data pipelines for LLMs, including SFT and RLHF data processes
- Identify, evaluate, and integrate diverse data sources for the Generative AI platform
- Develop workflows for chunking, indexing, ingestion, and vectorization of data
- Benchmark and implement vector stores, embedding techniques, and retrieval methods
- Create flexible pipelines supporting multiple embedding algorithms and search types
- Implement auto-tagging systems and data preparation for LLMs
- Develop tools for crawling, cleaning, and refining text and image data
- Collaborate with teams to ensure data quality and relevance
- Optimize data storage and processing using data lakehouse architectures
- Integrate workflows with Snowflake and vector store technologies
Requirements
- Master's degree in Computer Science, Data Science, or related field
- 3-5 years of experience in data engineering, preferably in AI/ML
- Proficiency in Python, JSON, HTTP, and related tools
- Strong understanding of LLM architectures and data needs
- Experience with RAG systems, knowledge bases, and vector databases
- Knowledge of embedding techniques, similarity search, and information retrieval
- Experience with data cleaning, tagging, and annotation processes
- Familiarity with data crawling and ethical considerations
- Strong problem-solving skills and ability to work in fast-paced environments
- Experience with Snowflake, vector store technologies, and data lakehouse architectures
- Excellent communication and collaboration skills
- Passion for innovative and ethical AI development
- Experience with frameworks like LangChain, LlamaIndex, Semantic Kernel, OpenAI functions
- Knowledge of LLM parameters and outcome evaluation metrics
Preferred Skills
- Experience with LLM/RAG frameworks
- Knowledge of distributed computing platforms (e.g., Spark, Dask)
- Experience with data versioning and experiment tracking tools
- Cloud platform experience (AWS, GCP, Azure)
- Understanding of data privacy and security
- Hands-on with data lakehouse solutions
- Proficiency in query optimization in Snowflake or Databricks
- Experience with vector store technologies
Benefits
- US employees benefit package
Seniority level
Employment type
Industries
- IT Services and IT Consulting