Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineer

Atribs Metscon

Abu Dhabi

On-site

AED 200,000 - 300,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A financial technology company in Abu Dhabi is seeking a talented Site Reliability Engineer (SRE) to ensure the reliability and performance of their banking services. You will define SLIs, lead incident management, and automate operations while working closely with cross-functional teams to drive improvements. Ideal candidates will have 5+ years of experience in SRE or DevOps, strong skills in AWS, and proficiency in CI/CD processes. This role promises significant involvement in a dynamic, innovative environment.

Qualifications

  • 5+ years of experience in SRE or DevOps roles.
  • Strong experience with performance troubleshooting.
  • Proven expertise in Infrastructure as Code (IaC).

Responsibilities

  • Define and implement SLIs / SLOs for digital banking services.
  • Lead incident management and root-cause analysis.
  • Automate operational processes and health checks.

Skills

Linux environments
Terraform
Kubernetes
AWS
Python
CI/CD pipelines

Education

Bachelor's degree in Computer Science

Tools

Dynatrace
Prometheus
Grafana
ELK stack
Job description

Site Reliability Engineer (SRE)

From designing fault-tolerant architectures to leading incident responses, you’ll have the freedom to shape how we deliver stable, secure, and high-performance banking services.

About the Role

We’re looking for a talented Site Reliability Engineer (SRE) to keep our systems running smoothly, reliably, and at scale. Through smart automation, deep observability, and a calm head in a crisis, you’ll help us balance speed, compliance, and stability, working alongside DevOps, Cloud, Quality Engineering, and Product teams to drive continuous improvements in performance, security, and resilience. You’ll play a key role in enhancing reliability, accelerating delivery, and ensuring seamless digital experiences for ADCB customers. This role reports directly to our Lead SRE / Tribe Executive Manager.

What You Will Be Doing
  • Define and implement SLIs / SLOs and error budgets for business‑critical digital banking services.
  • Build actionable observability (metrics, logs, traces, dashboards, and alerts) using Dynatrace, Prometheus, Grafana, and ELK, while reducing alert fatigue.
  • Leverage AI-driven insights and anomaly detection (Dynatrace Davis AI or equivalent AIOps platform) to proactively predict and resolve reliability issues before impact.
  • Lead incident management — from on‑call triage and root‑cause analysis to blameless post‑mortems with actionable follow‑ups.
  • Improve deployment safety with robust rollout / rollback strategies, canary and blue‑green deployments, and production readiness reviews.
  • Support and optimize microservices‑based architectures, ensuring service reliability, scalability, and inter‑service resilience.
  • Conduct capacity planning, performance tuning, and resilience testing, optimizing for both reliability and cost efficiency.
  • Automate operational toil — from runbooks and remediation scripts to proactive health checks and self‑healing workflows.
  • Collaborate with DevOps to embed reliability gates and validations into CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps).
  • Own and evolve the observability and AIOps stack, driving intelligent automation and predictive alerting capabilities.
  • Maintain high‑quality documentation, playbooks, and operational standards across environments.
  • Ensure operational compliance and security alignment with internal controls and regulatory standards.
  • Analyze system performance, availability, and cost data to continually optimize operations.
  • Provide reliability support and escalation guidance for critical production systems during major incidents.
Requirements – Experience and Qualifications
  • 5+ years of experience in SRE or DevOps roles, building and managing large‑scale, high‑availability systems across banking, fintech, e‑commerce, or other data‑intensive digital ecosystems.
  • Bachelor’s degree in Computer Science or equivalent technical experience.
  • Strong experience with Linux environments and performance troubleshooting.
  • Proven expertise in Terraform and Infrastructure as Code (IaC) methodologies.
  • Proficiency with Kubernetes and container orchestration in microservices environments.
  • Hands‑on experience with AWS (preferred); exposure to Azure or GCP is an advantage.
  • Deep knowledge of Dynatrace (AIOps, Davis AI), Prometheus, Grafana, and the ELK stack.
  • Experience implementing AI / ML‑driven reliability or automation solutions (AIOps, anomaly detection, predictive alerting).
  • Practical understanding of CI / CD pipelines (GitHub Actions, Jenkins, GitLab CI / CD or Azure DevOps).
  • Experience with Kafka, RabbitMQ, Redis, Aurora, and RDS databases.
  • Strong scripting or programming skills in Python, Bash, or Go.
The Ideal Candidate
  • Organized, structured, and meticulous in approach.
  • Experienced in cross‑functional collaboration and working with distributed teams.
  • Strong analytical mindset with excellent troubleshooting skills for complex production systems.
  • Calm and composed communicator under pressure, capable of leading during high‑impact incidents.
  • Proactive problem‑solver who anticipates issues and drives preventive improvements.
  • Passionate about AI‑driven automation, observability, and reliability engineering.
  • Continuously learning, keeping up‑to‑date with cloud‑native, microservices, and SRE best practices.
  • A collaborative and adaptable team player who thrives in a fast‑paced, regulated environment and is passionate about building reliable, scalable systems that empower digital banking innovation.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.