- Build and optimize ETL/ELT processes leveraging Databricks' native capabilities to handle large volumes of structured and unstructured data from various sources
- Implement data quality frameworks and monitoring solutions using Databricks data quality features to ensure data accuracy and reliability across all data products
- Establish best practices for data governance, security, and compliance within the Databricks ecosystem and integrate with enterprise systems
Operational Responsibilities:
- Monitor and maintain production data pipelines to ensure 99.9% uptime and optimal performance across all Databricks workloads and clusters
- Implement comprehensive logging, alerting, and monitoring systems using Databricks monitoring capabilities and integration with enterprise monitoring tools
- Perform regular health checks on Databricks cluster performance, job execution times, and resource utilization to identify and resolve bottlenecks proactively
- Manage incident response procedures for Databricks pipeline failures, including root cause analysis, resolution, and post-incident reviews
- Establish and maintain disaster recovery procedures and backup strategies for critical data assets within the Databricks environment
- Conduct regular performance tuning of Spark jobs and Databricks cluster configurations to optimize cost and execution efficiency
- Implement automated testing frameworks for Databricks-based data pipelines, including unit tests, integration tests, and data validation checks
- Maintain comprehensive documentation for all Databricks operational procedures, runbooks, and troubleshooting guides
- Coordinate scheduled maintenance windows and Databricks system upgrades with minimal business impact
- Manage user access controls, workspace configurations, and security policies within Databricks environments
- Monitor data lineage using Databricks Unity Catalog and maintain metadata management systems to support operational transparency and compliance requirements
- Establish capacity planning processes to forecast Databricks infrastructure needs and manage cloud costs effectively
- Provide technical guidance and mentorship to junior team members on Databricks best practices and data engineering principles
- Participate in on-call rotation for critical production systems with focus on Databricks platform stability
- Lead operational reviews and contribute to continuous improvement initiatives for Databricks platform reliability and efficiency
- Coordinate with infrastructure teams on Databricks cluster provisioning, network configurations, and security implementations
Requirements / Qualifications:
Education & Experience:
- Degree in Computer Science or Computer Engineering
- Minimum 8-10 years working experience in system operations compliance and management areas
- Project hands-on experience specifically with Databricks platform (primary requirement)
- Project experience in cloud operations or cloud architecture
- Must be cloud certified (AWS)
Core Technical Skills:
- Expert-level proficiency in Databricks platform, including workspace management, cluster configuration, and job orchestration
- Strong expertise in Apache Spark within Databricks environment, including Spark SQL, DataFrames, and RDDs
- Extensive experience with Delta Lake, including data versioning, time travel, and ACID transactions
- Proficiency in Databricks Unity Catalog for data governance and metadata management
- Good in-depth understanding of data warehouse concepts, data profiling, data verification and advanced analytics techniques
- Strong knowledge of monitoring, incident management, and cloud cost control
- Databricks (primary and most critical skill)
- AWS cloud services and architecture
- IDMC (Informatica Data Management Cloud)
- ML Ops practices within Databricks environment (Good to have)
- STATA for statistical analysis is advantage (Good to have)
- Amazon SageMaker integration with Databricks (Good to have)
- DataRobot platform integration (Good to have)
Soft Skills & Stakeholder Management:
- Good interpersonal skills with the ability to work with different groups of stakeholders
- Strong problem-solving skills and ability to work independently in a fast-paced environment with minimal supervision
- Excellent communication skills for technical documentation and cross-team collaboration
- Databricks certification (Associate or Professional level) - highly preferred
- Exposure to hospital information/clinical systems is an added advantage
- Understanding of DevOps practices and CI/CD pipelines for Databricks-based data engineering projects
- Knowledge of ITIL frameworks and operational best practices