ML Data Infrastructure Engineer
iitjobs
United States
Remote
USD 90,000 - 150,000
Full time
Boost your interview chances
Create a job specific, tailored resume for higher success rate.
Job summary
An established industry player is seeking a skilled data engineer to design and implement scalable data processing pipelines for machine learning. This role involves building and maintaining feature stores, developing data quality monitoring frameworks, and creating systems for dataset versioning and lineage tracking. The ideal candidate will have extensive experience with GCP's data infrastructure and be proficient in Python and SQL. Join a dynamic team where your contributions will significantly impact data-driven projects and innovations in machine learning.
Qualifications
- 7+ years of software engineering experience with a focus on data infrastructure.
- Expertise in GCP's data and ML infrastructure including BigQuery and Dataflow.
Responsibilities
- Design and implement scalable data processing pipelines for ML training.
- Build and maintain feature stores for batch and real-time features.
Skills
Software Engineering
Data Infrastructure
GCP (Google Cloud Platform)
Python
SQL
Data Processing Frameworks (Spark, Beam, Flink)
Data Quality Monitoring
Data Pipeline Orchestration (Airflow, Dagster)
Tools
BigQuery
Dataflow
Cloud Storage
Vertex AI Feature Store
Cloud Composer
Dataproc
Kafka
Kinesis
- Design and implement scalable data processing pipelines for ML training and validation
- Build and maintain feature stores with support for both batch and real-time features
- Develop data quality monitoring, validation, and testing frameworks
- Create systems for dataset versioning, lineage tracking, and reproducibility
- Implement automated data documentation and discovery tools
- Design efficient data storage and access patterns for ML workloads
- Partner with data scientists to optimize data preparation workflows
Technical Requirements:
- 7+ years of software engineering experience, with 3+ years in data infrastructure
- Strong expertise in GCP's data and ML infrastructure:
- BigQuery for data warehousing
- Dataflow for data processing
- Cloud Storage for data lakes
- Vertex AI Feature Store
- Cloud Composer (managed Airflow)
- Dataproc for Spark workloads
- Deep expertise in data processing frameworks (Spark, Beam, Flink)
- Experience with feature stores (Feast, Tecton) and data versioning tools
- Proficiency in Python and SQL
- Experience with data quality and testing frameworks
- Knowledge of data pipeline orchestration (Airflow, Dagster)
Nice to Have:
- Experience with streaming systems (Kafka, Kinesis)
- Experience with GCP-specific security and IAM best practices
- Knowledge of Cloud Logging and Cloud Monitoring for data pipelines
- Familiarity with Cloud Build and Cloud Deploy for CI/CD
- Experience with streaming systems (Pub/Sub, Dataflow)
- Knowledge of ML metadata management systems
- Familiarity with data governance and security requirements
- Experience with dbt or similar data transformation tools