LLM Data Engineer | United States | Fully Remote
Join to apply for the LLM Data Engineer | United States | Fully Remote role at Halo Media
Job Overview
We are seeking an experienced AI/LLM Data Engineer to build and maintain the data pipeline for our Generative AI platform. The ideal candidate will be well-versed in Large Language Model (LLM) technologies and have a strong background in data engineering, focusing on Retrieval-Augmented Generation (RAG) and knowledge-base techniques. This role is part of the AI COE within DX Tech & Digital and reports to the Director, AI Solutions & Development.
Responsibilities
- Design, implement, and maintain end-to-end multi-stage data pipelines for LLMs, including SFT and RLHF data processes.
- Identify, evaluate, and integrate diverse data sources to support the Generative AI platform.
- Develop and optimize workflows for chunking, indexing, ingestion, and vectorization of data.
- Benchmark and implement vector stores, embedding techniques, and retrieval methods.
- Create flexible pipelines supporting multiple embedding algorithms and search types.
- Implement auto-tagging systems and data preparation processes for LLMs.
- Develop tools for crawling, cleaning, and refining text and image data.
- Collaborate with teams to ensure data quality and relevance.
- Work with data lake architectures to optimize storage and processing.
- Integrate and optimize workflows using Snowflake and vector store technologies.
Minimum Requirements
- Master's degree in Computer Science, Data Science, or related field.
- 3-5 years of experience in data engineering, preferably in AI/ML.
- Proficiency in Python, JSON, HTTP, and related tools.
- Strong understanding of LLM architectures and data needs.
- Experience with RAG systems, knowledge bases, and vector databases.
- Familiarity with embedding techniques and information retrieval.
- Experience with data cleaning, tagging, and annotation.
- Knowledge of data crawling and ethical considerations.
- Strong problem-solving skills and ability to work in fast-paced environments.
- Experience with Snowflake integration in AI/ML pipelines.
- Experience with vector store technologies and data lakehouse architectures.
Preferred Skills
- Experience with LLM/RAG frameworks.
- Knowledge of distributed computing platforms (e.g., Spark, Dask).
- Familiarity with data versioning and experiment tracking tools.
- Experience with cloud platforms (AWS, GCP, Azure).
- Understanding of data privacy and security.
- Hands-on experience with lakehouse solutions.
- Proficiency in query optimization in Snowflake or Databricks.
- Experience with vector store technologies.
Benefits
- US employee benefits package.
Additional Details
- Seniority level: Mid-Senior level.
- Employment type: Full-time.
- Industry: IT Services and IT Consulting.