Overview
Lead Site Reliability Engineer (SRE) role at U X E SECURITY SOLUTIONS L.L.C. Emcode architects and operates the sovereign telematics backbone for the UAE's government and enterprise entities. The Lead SRE takes ownership of the deployment, maintenance, and resilience of our mission-critical SaaS / PaaS infrastructure to deliver 99.99% uptime and ensure data sovereignty.
The Lead SRE is the subject matter expert for the production environment, ensuring the scalability, security, and performance of platforms that process data at scale. The ideal candidate has expertise in Kubernetes distributions (e.g., Rancher, OpenShift) and manages large-scale distributed data systems (Cassandra, ScyllaDB, PostgreSQL). Responsibilities include automation, security hardening, and disaster recovery for national-scale telematics ecosystems, including SecurePath and Shahin.
Key Responsibilities
- Design, deploy, and maintain the scalable, secure, and resilient Rancher Kubernetes-based infrastructure for all Emcode SaaS and PaaS offerings.
- Automate infrastructure provisioning, configuration management, and application deployment pipelines to enhance velocity and reliability.
- Manage and optimize high-throughput, distributed database clusters (Cassandra, ScyllaDB, PostgreSQL, MongoDB, ElasticSearch, Kafka and/or RabbitMQ) ensuring data integrity and performance.
- Develop and maintain sophisticated monitoring, logging, and alerting systems to ensure proactive issue identification and resolution.
System Resilience, Security
- Master and manage all aspects of our Linux-based environment (primarily Ubuntu Server), ensuring systems are hardened, patched, and configured according to industry best practices.
- Architect, implement, and regularly test disaster recovery and business continuity plans to uphold 99.99% uptime SLA.
- Implement and enforce rigorous security protocols across the infrastructure, protecting sensitive telematics data and ensuring compliance with DESC and SIRA standards.
- Conduct performance tuning, capacity planning, and cost optimization for sovereign self-hosted and cloud infrastructure.
Operational Excellence & Collaboration
- Serve as the highest point of escalation for complex infrastructure-related incidents, leading troubleshooting and resolution efforts.
- Collaborate with Software Engineering to refine CI/CD pipelines for Go-based microservices and other applications.
- Create and maintain detailed documentation for infrastructure architecture, system configurations, and operational procedures.
- Provide mentorship and technical guidance to other members of the technology team.
Required Qualifications & Experience
- Bachelor’s degree in Computer Science, Systems Engineering, or related technical field.
- Minimum of 8 years of hands-on experience in SRE, DevOps, or Systems Engineering in large-scale 24/7 production environments.
- Expert-level mastery of self-hosted Kubernetes, including cluster design, deployment, scaling, and security.
- Experience deploying and managing large-scale distributed NoSQL databases (Cassandra, ScyllaDB) and relational databases (PostgreSQL).
- Proficiency in Linux administration (Ubuntu Server and/or SUSE), including server hardening and understanding of hard and soft limits.
- Strong scripting and automation skills (Bash, Python, Go) and experience with IaC tools (Pulumi, Terraform, Ansible).
- Deep proficiency with software load balancing (HAProxy) and hardware load balancing (Fortigate, Barracuda, Palo Alto) with prior high-load deployments.
- Experience with NAS, SAN storage, and self-hosted S3 solutions (e.g., MinIO).
Desired Skills & Competencies
- Security acumen: network security, access control, vulnerability management.
- Disaster recovery: experience designing and executing DR drills for mission-critical systems.
- Database management: performance tuning, replication, backup/recovery strategies.
- Go (Golang) and Python: familiarity and experience supporting Go-based applications.
- Problem-solving: strong analytical and troubleshooting skills across the technology stack.
- Ownership: proactive and accountable, thrives in high-stakes environments.
- Collaboration: ability to work with software development teams to promote reliability and operational excellence.
Seniority level: Mid-Senior level
Employment type: Full-time
Job function: Engineering and Information Technology
Industries: Security Systems Services