Enable job alerts via email!

Data Engineer - Pyspark

Virtusa

Dubai

On-site

AED 120,000 - 200,000

Full time

3 days ago
Be an early applicant

Job summary

An innovative technology company is seeking a highly skilled Data Engineer to enhance its data engineering team in Dubai. The ideal candidate will design and maintain ETL pipelines using PySpark on the Cloudera Data Platform while ensuring high data quality and availability. With a focus on big data ecosystems and cloud-native tools, this role offers the opportunity to implement best practices and drive impactful business outcomes.

Qualifications

  • At least 3 years of experience as a Data Engineer with a focus on PySpark and Cloudera Data Platform.
  • Experience with data ingestion, transformation, and optimization on the Cloudera Data Platform.
  • Strong Linux scripting skills.

Responsibilities

  • Design, develop, and maintain scalable ETL pipelines using PySpark.
  • Implement data ingestion processes from various sources to the data lake or data warehouse.
  • Ensure data accuracy and reliability through quality checks and validation routines.

Skills

PySpark
Cloudera Data Platform
Hadoop
Kafka
Linux Scripting

Education

Bachelor's or Master's degree in Computer Science, Data Engineering, Information Systems

Tools

Apache Oozie
Airflow

Job description

About the Role

We are seeking a highly skilled Data Engineer with deep expertise in PySpark and the Cloudera Data Platform (CDP) to join our data engineering team. As a Data Engineer, you will be responsible for designing, developing, and maintaining scalable data pipelines that ensure high data quality and availability across the organization.

This role requires a strong background in big data ecosystems, cloud-native tools, and advanced data processing. The ideal candidate has hands-on experience with data ingestion, transformation, and optimization on the Cloudera Data Platform, along with a proven track record of implementing data engineering best practices. You will work closely with other data engineers to build solutions that drive impactful business outcomes.

Pipeline Development
  • Design, develop, and maintain highly scalable and optimized ETL pipelines using PySpark on the Cloudera Data Platform, ensuring data integrity.
Ingestion
  • Implement and manage data ingestion processes from various sources (e.g., relational databases, APIs, file systems) to the data lake or data warehouse.
Transformation and Processing
  • Use PySpark to process, cleanse, and transform large datasets into formats that support analytical needs and business insights.
Optimization
  • Conduct performance tuning of PySpark code and Cloudera components to optimize resource utilization and reduce ETL runtime.
Quality and Validation
  • Implement data quality checks, monitoring, and validation routines to ensure data accuracy and reliability.
Orchestration
  • Automate data workflows using tools like Apache Oozie, Airflow, or similar orchestration tools within the Cloudera environment.
Experience & Skills
  • Bachelor's or Master's degree in Computer Science, Data Engineering, Information Systems, or a related field.
  • At least 3 years of experience as a Data Engineer with a focus on PySpark and Cloudera Data Platform.

Skills:

  • PySpark: Advanced proficiency, including working with RDDs, DataFrames, and optimization.
  • Cloudera Data Platform (CDP): Strong experience with components like Cloudera Manager, Hive, Impala, HDFS, and Data Warehousing.
  • Big Data Technologies: Familiarity with Hadoop, Kafka, and distributed computing.
  • Scheduling & Automation: Experience with Apache Oozie, Airflow, or similar tools.
  • Scripting: Strong Linux scripting skills.

Employment Type: Full-Time

Experience: 3+ years

Vacancy: 1

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.