Enable job alerts via email!

Senior Backend Engineer, Data Mining

MOTIONAL SINGAPORE PTE. LIMITED

Singapore

On-site

SGD 80,000 - 110,000

Full time

Today

Be an early applicant

Job summary

An innovative technology company in Singapore seeks a Senior Backend Engineer to design and build backend systems for data mining and analysis of multimodal sensor data. Candidates should have extensive experience with distributed systems, particularly in Ray or Spark, and demonstrate strong skills in Python and SQL. The role involves optimizing data pipelines and collaborating with ML engineers for production reliability and data quality.

Qualifications

6+ years designing, building, and operating large-scale distributed systems.
Deep expertise with Ray or Spark for large-scale inference workloads.
Experience optimizing production data pipelines.

Responsibilities

Architect and build the backend systems for OmniTag.
Scale multimodal data pipelines for heterogeneous data.
Enhance the billion-scale vector search engine.

Skills

Python proficiency

Distributed systems design

SQL

Data processing with Ray/Spark

Cloud infrastructure (AWS)

Data pipeline optimization

Education

BS in Computer Science or related field

Tools

Ray

Spark

AWS (S3, EC2, EKS)

Mission Summary

At Motional, we’re transforming how autonomous vehicles discover critical intelligence hidden within petabytes of multimodal sensor data. Our next-generation autonomous driving stack depends on finding the rare edge cases, long-tail scenarios, and model errors that matter most. OmniTag, our ML-powered multimodal data mining framework, is the engine that powers this discovery. As a Senior Backend Engineer on the Data Mining team, you’ll architect and own the production systems that enable data scientists and ML engineers to rapidly mine, analyze, and extract insights from billions of data points across cameras, LiDAR, radar, and other modalities. You won’t maintain a platform, you’ll evolve its core foundation, ensuring OmniTag scales to support Motional’s most ambitious autonomy challenges. Your work directly impacts the quality and speed at which we improve our perception and planning models.

What You’ll Do

Architect the OmniTag Engine: Design and build the high-throughput, low-latency backend systems that execute billion-scale inference across Ray/Spark, transforming raw sensor data into unified multimodal representations. You’ll optimize for both query latency and resource efficiency in a cost-sensitive, cloud-based environment.
Scale Multimodal Data Pipelines: Own the complete data journey— from ingestion, normalization, and preprocessing of heterogeneous modalities (image, video, LiDAR, audio) through encoding, indexing, and cached embedding storage. Ensure pipelines are robust, observable, and meet the SLOs expected by downstream ML teams.
Evolve the Vector Search and Retrieval Engine: Enhance our in-house billion-scale vector search engine to power RAG-driven few-shot dataset creation. Optimize embedding storage, retrieval performance, and filtering across billions of examples to enable rapid interactive mining workflows.
Own Data Quality and Observability: Build comprehensive monitoring, logging, and alerting for multimodal data preprocessing pipelines. Develop data validation frameworks that catch regressions in data alignment, normalization, or encoding quality—critical for maintaining model performance.
Collaborate on Encoder-Decoder Adaptation: Work closely with ML engineers to support domain-specific fine-tuning workflows, model versioning, and A/B testing of new encoders and decoders. Ensure the backend infrastructure enables rapid experimentation with emerging open-source multimodal foundation models.
Drive Production Reliability: Establish patterns for graceful degradation, fault tolerance, and cost optimization. Operate OmniTag as a mission-critical data platform serving the entire ML organization, with a focus on reliability, debuggability, and operational excellence.

What We’re Looking For

BS in Computer Science or a related field, or equivalent professional experience
6+ years designing, building, and operating large-scale distributed systems in production environments
Deep, hands-on expertise with Ray or Spark (or both) for distributed data processing and large-scale inference workloads
Expert-level Python proficiency with strong software engineering fundamentals: testing (unit, integration, and end-to-end), CI/CD pipelines, containerization, and code review practices
Proven experience optimizing and scaling production data pipelines that process terabytes or petabytes of data
Strong SQL and data manipulation skills; comfort with both structured and semi-structured data
Experience with cloud infrastructure (AWS preferred: S3, EC2, EKS, EMR, IAM) and infrastructure-as-code patterns
Demonstrated track record of shipping robust, well-tested, production-grade systems and mentoring junior engineers

Bonus Points

MS/PhD in Computer Science, Machine Learning, or a related field.
Experience building or scaling vector databases, large-scale information retrieval systems, or similarity search engines.
Hands-on work with multimodal machine learning models, foundation models (LLMs/VLMs), or embeddings-based systems.
Familiarity with ML frameworks (PyTorch, JAX) and the ecosystem around multimodal models.
Production experience with workflow orchestration (Airflow, Kubeflow, Dagster) and stream processing (Kafka, Flink).
Understanding of model serving patterns, feature stores, or ML ops infrastructure.
Domain knowledge in autonomous driving, computer vision, or sensor fusion.
Experience with ML-based data mining, active learning, or contrastive learning approaches.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.