AI Integration Engineer (Talent Pool) Join to apply for the AI Integration Engineer (Talent Pool) role at Halo Media.
We are seeking an exceptional AI Integration Engineer who operates at the intersection of development, operations, data, and systems engineering to build solutions for large-scale continuous data transformation and delivery.
This role focuses specifically on building and maintaining data pipelines for both structured and unstructured data, enabling the development and deployment of AI / ML models that power our RAG-based document processing and insight generation systems.
Key Responsibilities
- Design and implement data integrations and ingestion processes for internal and external data sources.
- Build and maintain scalable data pipelines for ingesting, processing, and transforming unstructured data sources (customer feedback, documents, multimedia content).
- Develop data models and mapping rules to transform raw data into actionable insights and structured outputs.
- Architect and implement semantic layers that integrate analytics data from multiple sources efficiently.
- Develop and maintain robust backend APIs and services supporting the entire prompt-to-answer workflow.
- Implement and optimize retrieval logic including vector search, hybrid search, and advanced information retrieval techniques.
- Manage document ingestion pipelines including parsing, OCR, chunking, and embedding generation.
- Support integration of various LLM providers (OpenAI, Azure AI, Anthropic) with internal business data sources.
- Ensure reliability, scalability, and low latency of AI response generation systems.
- Implement data governance policies and procedures for responsible and ethical use of data in AI applications.
- Develop data quality monitoring and validation processes specifically for AI / ML datasets, including bias identification and mitigation.
- Build and maintain monitoring, alerting, and observability systems for AI infrastructure.
- Collaborate with analytics and data science teams to understand requirements and deliver solutions.
- Work with data scientists to ensure data is available in appropriate format and quality for model training and deployment.
- Maintain comprehensive documentation including data models, mapping rules, and data dictionaries.
- Partner with internal business stakeholders, technology resources, and external vendors.
Qualifications
- Bachelor's degree in Computer Science, Engineering, or equivalent work experience.
- 5+ years of experience in designing, building, and maintaining scalable data solutions for large-scale analytics.
- Proven ability to lead development projects from start to finish with demonstrated results.
- Proficiency in Python, Java, or R and open-source frameworks for distributed processing (Hadoop, Spark).
- Expert-level SQL and development experience with cloud database environments (Snowflake, Redshift, Databricks).
- Hands-on experience with modern cloud data stack tools for code management, versioning (Git), CI/CD, and automation.
- Experience with orchestration tools (Apache Airflow) and monitoring & alerting systems.
- Strong understanding of data modeling, data warehousing, and ETL concepts.
- Experience with vector databases (Pinecone, Milvus, Weaviate, Chroma).
- Proficiency in handling unstructured data formats (JSON, Parquet, text, images, audio, video).
- Familiarity with AI/ML model development lifecycle and data requirements for training and deployment.
- Experience with cloud-based AI/ML platforms and services.
- Knowledge of data augmentation techniques for improving AI/ML model performance.
- Experience with data labeling platforms (Amazon SageMaker Ground Truth, Labelbox).
- Understanding of responsible AI principles and data privacy regulations (GDPR, CCPA).
- Experience with data governance and observability tools (Datahub, Collibra).
- Basic frontend development experience (HTML, CSS, JavaScript).
Tools & Technologies: Programming & Frameworks Python, Java, R Apache Spark, Apache Hadoop FastAPI, Django, Flask Data & AI Platforms Snowflake, Redshift, Databricks Pinecone, Milvus, LlamaIndex, Chroma LangChain, LlamaIndex OpenAI, Azure AI, Anthropic, Cohere Cloud & Infrastructure AWS, Azure, Google Cloud Platform Docker, Kubernetes Apache Airflow, Apache Kafka Development Tools Git, GitHub, GitLab Jenkins, GitHub Actions Jupyter Notebooks, Dataiku #J Ljbffr