Enable job alerts via email!

Data Engineer

Institute Of Foundation Models

Abu Dhabi

On-site

AED 120,000 - 200,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading AI research institute in Abu Dhabi is seeking a Data Engineer specializing in Natural Language Processing. You will gather and prepare datasets to support NLP research, develop web crawling solutions, and implement scalable data pipelines. The ideal candidate has extensive experience in Python and data engineering. This role offers the chance to work alongside world-class researchers on impactful AI projects.

Qualifications

Bachelor's degree in a related technical field is required.
Master’s degree is preferred.

Responsibilities

Rapidly collect and prepare high-quality datasets for NLP research.
Develop and maintain web crawling solutions and APIs.
Refine outputs from LLMs to generate structured datasets.
Implement scalable data pipelines and document methodologies.
Collaborate with researchers to ensure data meets quality standards.

Skills

Data engineering

Python

Web crawling

Data processing

SQL

Cloud infrastructure

Data structures

Collaboration

Education

Bachelor's degree in Computer Science, Data Science, Engineering

Master's degree or equivalent experience

Tools

AWS

Spark

Kafka

Kubernetes

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem‑solving skills will be instrumental in establishing MBZUAI as a global hub for high‑performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Data Engineer specializing in Natural Language Processing (NLP) and large‑scale data processing, you will quickly and effectively gather, curate, and prepare high‑quality datasets to support cutting‑edge NLP research. Your role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM‑generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.

Key Responsibilities

Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLP researchers, delivering data within tight timelines (typically within 1-2 days).
Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
Represent MBZUAI at industry and research forums, showcasing technical capabilities in large‑scale data processing and AI data infrastructure.
Perform all other duties as reasonably directed by the line manager commensurate with these functional objectives.

Academic Qualifications

Bachelor's degree in Computer Science, Data Science, Engineering, or a related technical field required.
Master’s degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.

Professional Experience - Required

Extensive experience in data engineering, data processing, and automation using Python.
Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
Excellent problem‑solving abilities, attention to detail, and the capability to rapidly address technical challenges.
Strong communication and collaboration skills with cross‑functional teams.

Professional Experience - Preferred

Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
Experience with refining outputs from large‑scale AI models, such as LLM‑generated data.
Contributions to open‑source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
Familiarity with the latest advancements in NLP data processing and large language model technologies.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top locations

Top companies

Top positions