Enable job alerts via email!

Data Engineer (Senior)

BETTERDATA PTE. LTD.

Singapore

Hybrid

SGD 90,000 - 120,000

Full time

Yesterday
Be an early applicant

Job summary

A data-driven technology company in Singapore is seeking a Senior Data Engineer to build and maintain data infrastructure for scalable solutions. The ideal candidate has strong experience in scaling data and machine learning systems. Key responsibilities include architecting data pipelines, ensuring data quality, and collaborating with multiple teams. This role offers flexible work arrangements and equity eligibility.

Benefits

Flexible time-off arrangements
Work from office or WFH on some days
Competitive equity packages

Qualifications

  • 3+ years of experience in building scalable data solutions.
  • Expertise in automated data quality frameworks.
  • Hands-on experience with web scraping tools.

Responsibilities

  • Build data ingestion pipelines from enterprise relational databases.
  • Design scalable data pipelines for batch processing.
  • Implement monitoring and alerting for pipeline health.

Skills

Scaling data pipelines
Machine learning systems
ETL/ELT pipelines
Web scraping

Education

Bachelor's degree in Computer Science or related field

Tools

Python
Pandas
Spark
Airflow

Job description

Who Are We Looking For:

We are seeking a experienced Data Engineer (Senior) to build and maintain data infrastructure to convert our research into scalable, production-ready solutions for synthetic tabular data generation. You will also architect and operate our large-scale data curation, scraping, and cleaning pipelines to deliver massive amounts of datasets for pretraining and finetuning large language models on tabular and unstructured domains.

This is an individual contributor (IC) role suited for someone who thrives in a fast-paced, early-stage start-up environment. The ideal candidate has experience scaling data and machine learning systems to handle datasets with billions of records and can build and optimize complex data pipelines for enterprise applications. You'll work closely with software, machine learning and applied research teams to optimize performance and ensure seamless integration of systems, handling data from financial institutions, government agencies, consumer brands and more.

Key Responsibilities:
Data Infrastructure and Pipeline Development:
  • Build data ingestion pipelines from enterprise relational databases (e.g. Oracle, SQL Server, PostgreSQL, MySQL, Databricks, Snowflake, BigQuery) and files (e.g. Parquet, CSV) for large-scale synthetic data pipelines.
  • Design scalable data pipelines for batch processing.
  • Architect and maintain data warehouses and data lakes (e.g. Delta Lake) optimized for synthetic data training and generation workflows.
  • Seamlessly transform Pandas-based research code into production-ready pipelines.
  • Build automated data quality monitoring and validation systems to ensure data integrity throughout the pipeline lifecycle.
  • Implement comprehensive data lineage tracking and audit capabilities for regulatory compliance and privacy validation.
  • Design robust error handling mechanisms, with automatic retries and data recovery in case of pipeline failures.
  • Track performance metrics such as data throughput, latency, and processing times to ensure efficient pipeline operations at scale.
  • Implement monitoring and alerting (e.g. Prometheus, Grafana) for pipeline health, throughput, and data quality metrics.
  • Optimize resource allocation and cost efficiency for distributed processing at terabytes to petabyte scale.
Massive-Scale Data Collection & Ingestion:
  • Design and build distributed web scraping clusters to extract data from millions of pages.
  • Build LLM-aided data filtering systems combining automated model scoring to evaluate and prioritize high-quality content.
Understanding of ML concepts and algorithms:
  • Fair understanding of machine learning concepts, training workflows and algorithms, with familiarity in tools like PyTorch and Hugging Face.
Documentation & Reporting:
  • Create clear documentation of data pipelines, workflows, and system architectures to enable smooth handovers and collaboration across teams.
Qualifications:
  • Bachelor's degree in Computer Science, Software Engineering, Data Engineering, or related field with strong foundation in distributed systems and data processing
  • Expert proficiency at scaling data pipelines and machine learning systems to handle billions of rows in enterprise environments.
  • 3+ years of experience in building scalable data solutions with Python and distinct libraries such as:Data Science Libraries: Pandas, NumPy, Scikit-learn.
    Deep Learning Libraries: Pytorch
    Scaling Libraries: Spark, Dask, etc
    Orchestration tools: Airflow, Dagster, etc
    Data validation: Pandera, Pydantic, etc
  • Expertise in automated data quality frameworks including rule-based and AI-based automation for format validation, anomaly detection, statistical validation.
  • Proficiency in building ETL/ELT pipelines and managing data across relational databases (e.g. PostgreSQL, Oracle Database, SQL Server, MySQL), data lakes (e.g. Delta Lake) and cloud storage.
  • Experience in building data monitoring and alerting systems.
  • Hands-on experience with web scraping tools (Scrapy, Selenium, Puppeteer).
  • Experience building ML data pipelines and supporting infrastructure for training and deploying machine learning models at scale.
Good to Have:
  • Experience with data governance frameworks and compliance requirements (GDPR, CCPA, PDPA) in data processing systems.
  • Experience with containerization and orchestration using Docker, Kubernetes, and cloud-native deployment strategies.
  • Strong knowledge of cloud platforms (AWS, GCP, Azure) and their data services (S3, BigQuery, Data Lake Storage, etc).
Why Join Us:

This is a unique opportunity for someone looking to actively build and scale systems in a fast-moving start-up. If you’ve successfully scaled machine learning and data systems to billions of rows and thrive in a dynamic, hands-on environment, this role is for you.

Benefits:
  • Flexible time-off arrangements
  • Flexible work arrangements - work from office at One North or WFH on some days
  • Equity eligibility: Competitive equity packages, with grant size evaluated based on the candidate’s experience, skills, and impact.
How to apply:

Does this role sound like a good fit to you?

  • We see this first: Submit your application here
  • We see this last: If the above does not work, you may email us your CV (pdf format) at jobs@betterdata.ai.Include the title of the role in your subject
    Indicate your available start - end dates (DDMMYY - DDMMYY)
    Send along links/supporting information that best showcase the relevant things you have built and done
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.