Job Description:
Responsibilities:- Develop, optimize, and maintain ETL/ELT pipelines using PySpark and SQL.
- Work with structured and unstructured data to build scalable data solutions.
- Write efficient and scalable PySpark scripts for data transformation and processing.
- Optimize SQL queries, stored procedures, and indexing strategies to enhance performance.
- Design and implement data models, schemas, and partitioning strategies for large-scale datasets.
- Collaborate with Data Scientists, Analysts, and other Engineers to integrate data workflows.
- Ensure data quality, validation, and consistency in data pipelines.
- Implement error handling, logging, and monitoring for data pipelines.
- Work with cloud platforms (AWS, Azure, or GCP) for data processing and storage.
- Optimize data pipelines for cost efficiency and performance.
Technical Skills Required:
- Strong experience in Python for data engineering tasks.
- Proficiency in PySpark for large-scale data processing.
- Deep understanding of SQL (Joins, Window Functions, CTEs, Query Optimization).
- Experience in ETL/ELT development using Spark and SQL.
- Experience with cloud data services (AWS Glue, Databricks, Azure Synapse, GCP BigQuery).
- Familiarity with orchestration tools (Airflow, Apache Oozie).
- Experience with data warehousing (Snowflake, Redshift, BigQuery).
- Understanding of performance tuning in PySpark and SQL.
- Familiarity with version control (Git) and CI/CD pipelines.