Enable job alerts via email!

Lead Site Reliability Engineer SRE

U X E SECURITY SOLUTIONS L.L.C

Dubai

On-site

AED 200,000 - 300,000

Full time

9 days ago

Job summary

A security solutions provider in Dubai seeks a Lead Site Reliability Engineer to manage mission-critical SaaS/PaaS infrastructure. The role involves ensuring system resilience, scalability, and security while delivering 99.99% uptime. Ideal candidates will have significant experience with Kubernetes, distributed databases, and automation tools. This is a full-time position suitable for those looking to make an impact in government and enterprise sectors.

Qualifications

Minimum of 8 years of experience in SRE, DevOps, or Systems Engineering.
Expert-level mastery of self-hosted Kubernetes.
Proficiency in Linux administration and server hardening.

Responsibilities

Design and maintain Kubernetes-based infrastructure.
Automate infrastructure provisioning and application deployment.
Manage distributed database clusters ensuring performance.

Skills

Kubernetes

Linux administration

Scripting skills (Bash, Python, Go)

Distributed databases (Cassandra, PostgreSQL)

Disaster recovery planning

Education

Bachelor’s degree in Computer Science or related field

Tools

Kubernetes distributions (e.g., Rancher)

Terraform

Pulumi

Ansible

Overview

Lead Site Reliability Engineer (SRE) role at U X E SECURITY SOLUTIONS L.L.C. Emcode architects and operates the sovereign telematics backbone for the UAE's government and enterprise entities. The Lead SRE takes ownership of the deployment, maintenance, and resilience of our mission-critical SaaS / PaaS infrastructure to deliver 99.99% uptime and ensure data sovereignty.

The Lead SRE is the subject matter expert for the production environment, ensuring the scalability, security, and performance of platforms that process data at scale. The ideal candidate has expertise in Kubernetes distributions (e.g., Rancher, OpenShift) and manages large-scale distributed data systems (Cassandra, ScyllaDB, PostgreSQL). Responsibilities include automation, security hardening, and disaster recovery for national-scale telematics ecosystems, including SecurePath and Shahin.

Key Responsibilities

Design, deploy, and maintain the scalable, secure, and resilient Rancher Kubernetes-based infrastructure for all Emcode SaaS and PaaS offerings.
Automate infrastructure provisioning, configuration management, and application deployment pipelines to enhance velocity and reliability.
Manage and optimize high-throughput, distributed database clusters (Cassandra, ScyllaDB, PostgreSQL, MongoDB, ElasticSearch, Kafka and/or RabbitMQ) ensuring data integrity and performance.
Develop and maintain sophisticated monitoring, logging, and alerting systems to ensure proactive issue identification and resolution.

System Resilience, Security

Master and manage all aspects of our Linux-based environment (primarily Ubuntu Server), ensuring systems are hardened, patched, and configured according to industry best practices.
Architect, implement, and regularly test disaster recovery and business continuity plans to uphold 99.99% uptime SLA.
Implement and enforce rigorous security protocols across the infrastructure, protecting sensitive telematics data and ensuring compliance with DESC and SIRA standards.
Conduct performance tuning, capacity planning, and cost optimization for sovereign self-hosted and cloud infrastructure.

Operational Excellence & Collaboration

Serve as the highest point of escalation for complex infrastructure-related incidents, leading troubleshooting and resolution efforts.
Collaborate with Software Engineering to refine CI/CD pipelines for Go-based microservices and other applications.
Create and maintain detailed documentation for infrastructure architecture, system configurations, and operational procedures.
Provide mentorship and technical guidance to other members of the technology team.

Required Qualifications & Experience

Bachelor’s degree in Computer Science, Systems Engineering, or related technical field.
Minimum of 8 years of hands-on experience in SRE, DevOps, or Systems Engineering in large-scale 24/7 production environments.
Expert-level mastery of self-hosted Kubernetes, including cluster design, deployment, scaling, and security.
Experience deploying and managing large-scale distributed NoSQL databases (Cassandra, ScyllaDB) and relational databases (PostgreSQL).
Proficiency in Linux administration (Ubuntu Server and/or SUSE), including server hardening and understanding of hard and soft limits.
Strong scripting and automation skills (Bash, Python, Go) and experience with IaC tools (Pulumi, Terraform, Ansible).
Deep proficiency with software load balancing (HAProxy) and hardware load balancing (Fortigate, Barracuda, Palo Alto) with prior high-load deployments.
Experience with NAS, SAN storage, and self-hosted S3 solutions (e.g., MinIO).

Desired Skills & Competencies

Security acumen: network security, access control, vulnerability management.
Disaster recovery: experience designing and executing DR drills for mission-critical systems.
Database management: performance tuning, replication, backup/recovery strategies.
Go (Golang) and Python: familiarity and experience supporting Go-based applications.
Problem-solving: strong analytical and troubleshooting skills across the technology stack.
Ownership: proactive and accountable, thrives in high-stakes environments.
Collaboration: ability to work with software development teams to promote reliability and operational excellence.

Seniority level: Mid-Senior level

Employment type: Full-time

Job function: Engineering and Information Technology

Industries: Security Systems Services

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs

Lead Site Reliability Engineer SRE

U X E SECURITY SOLUTIONS L.L.C

Dubai

On-site

AED 200,000 - 300,000