Job Description - Senior Software Engineer (CREQ242894)
Senior Software Engineer - ( CREQ242894 )
- Kafka, Confluent
- Real-Time, Streaming Data Ingestion
- Producers, Consumers, Topics
- Data Lake Integration, Lakehouse Integration
Databricks Engineer ______________________________ Positions: 2
Overview
Duration: 3 Months | Location: Dubai. Bank is building a scalable data ingestion and streaming platform that ingests change data capture (CDC) events from diverse source systems (databases and applications), processes them in real time and lands curated data into our analytics lake.
Responsibilities
- Design and develop streaming ingestion pipelines.
- Use Apache Spark (Structured Streaming) and Databricks Auto Loader to consume files from cloud storage or messages from Kafka/RabbitMQ/Confluent Cloud and ingest them into Delta Lake, ensuring schema evolution and exactly‑once semantics.
- Implement CDC and deduplication logic. Capture change events from source databases using Debezium, built‑in CDC features of SQL Server/Oracle or other connectors. Apply watermarking and drop duplicate strategies based on primary keys and event timestamps.
- Ensure data quality and fault tolerance. Configure checkpointing, error handling and dead‑letter queues (DLQ) so that malformed or late data can be quarantined and replayed. Optimize file sizes, partitioning and clustering to maintain performance.
- Scale ingestion through configuration. Build a config‑driven framework (e.g., using Airflow, DBX Jobs or Delta Live Tables) that iterates over metadata tables to deploy/update ingestion pipelines for hundreds of tables/sources without code duplication.
- Collaborate on architecture and orchestration. Contribute to the overall data platform architecture—integrating data sources, message queues, processing engines and storage—and define orchestration patterns for backfill, replay and streaming jobs.
- Implement monitoring, observability and security. Capture streaming query metrics and publish them to monitoring platforms (Prometheus, Grafana). Set up dashboards for lag, files processed and processing duration. Enforce role‑based access control, encryption and data masking.
- Work with data consumers. Partner with analytics teams, data scientists and downstream application developers to ensure that ingested data meets their requirements. Provide documentation, metadata and lineage for all tables.
- Participate in DevOps processes. Use CI/CD pipelines (e.g., Jenkins, GitHub Actions) to automate deployment of jobs; manage infrastructure with Terraform or similar tools; follow best practices for version control and code reviews.
Required skills & Experience
- 5–8 years of experience designing and building data pipelines using Apache Spark, Databricks or equivalent big‑data frameworks.
- Hands‑on expertise with streaming and messaging systems such as Apache Kafka (publish‑subscribe architecture), Confluent Cloud, RabbitMQ or Azure Event Hub. Experience creating producers, consumers and topics and integrating them into downstream processing.
- Deep understanding of relational databases and CDC. Proficiency in SQL Server, Oracle or other RDBMSs; experience capturing change events using Debezium or native CDC tools and transforming them for downstream consumption.
- Proficiency in programming languages such as Python, Scala or Java and solid knowledge of SQL for data manipulation and transformation.
- Cloud platform expertise. Experience with Azure or AWS services for data storage, compute and orchestration (e.g., ADLS, S3, Azure Data Factory, AWS Glue, Airflow, DBX, DLT).
- Data modelling and warehousing. Knowledge of data Lakehouse architectures, Delta Lake, partitioning strategies and performance optimisation.
- Version control and DevOps. Familiarity with Git and CI/CD pipelines; ability to automate deployment and manage infrastructure as code.
- Strong problem‑solving and communication skills. Ability to work with cross‑functional teams and articulate complex technical concepts to non‑technical stakeholders.
Preferred/Bonus Skills
- Experience with event‑driven architectures and micro‑services integration.
- Exposure to NiFi, Flume or other ingestion frameworks for connecting heterogeneous sources.
- Knowledge of graph processing or machine learning pipelines on Spark.