The Databricks Data Engineer will be responsible for the design, development, and maintenance of scalable and high-performance data pipelines within the Databricks Lakehouse Platform. This role involves using Apache Spark, Delta Lake, and various Databricks services to process large volumes of batch and streaming data, ensuring data quality, reliability, and accessibility for data consumers.
Key Responsibilities
- Data Pipeline Development: Design, build, and maintain robust and scalable ETL/ELT pipelines using Databricks, PySpark/Scala, and SQL to ingest, transform, and load data from diverse sources (e.g., databases, APIs, streaming services) into Delta Lake.
- Databricks Ecosystem Utilization: Utilize core Databricks features such as Delta Lake, Databricks Workflows (or Jobs), Databricks SQL, and Unity Catalog for pipeline orchestration, data management, and governance.
- Performance Optimization: Tune and optimize Spark jobs and Databricks clusters for maximum efficiency, performance, and cost-effectiveness.
- Data Quality and Governance: Implement data quality checks, validation rules, and observability frameworks. Adhere to data governance policies and leverage Unity Catalog for fine-grained access control.
- Collaboration: Work closely with Data Scientists, Data Analysts, and business stakeholders to translate data requirements into technical solutions and ensure data is structured to support analytics and machine learning use cases.
- Automation & DevOps: Implement CI/CD and DataOps principles for automated deployment, testing, and monitoring of data solutions.
- Documentation: Create and maintain technical documentation for data pipelines, data models, and processes.
- Troubleshooting: Monitor production pipelines, troubleshoot complex issues, and perform root cause analysis to ensure system reliability and stability.
Qualifications
Required Skills & Experience
- 5+ years of hands‑on experience in Data Engineering.
- 3+ years of dedicated experience building solutions on the Databricks Lakehouse Platform.
- Expert proficiency in Python (PySpark) and SQL for data manipulation and transformation.
- In-depth knowledge of Apache Spark and distributed computing principles.
- Experience with Delta Lake and Lakehouse architecture.
- Strong understanding of ETL/ELT processes, data warehousing, and data modeling concepts.
- Familiarity with at least one major cloud platform (AWS, Azure, or GCP) and its relevant data services.
Preferred Skills & Certifications
- Experience with Databricks features like Delta Live Tables (DLT), Databricks Workflows, and Unity Catalog.
- Experience with streaming technologies (e.g., Kafka, Spark Streaming).
- Familiarity with CI/CD tools and Infrastructure-as-Code (e.g., Terraform, Databricks Asset Bundles).
- Databricks Certified Data Engineer Associate or Professional certification.