Enable job alerts via email!

Senior Site Reliability Engineer (SRE)

Salla

Remote

SAR 93,000 - 150,000

Full time

30+ days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A tech-driven company in Saudi Arabia is seeking a Senior Site Reliability Engineer to design, scale, and secure their platform infrastructure. The ideal candidate will have over 8 years of experience in SRE, deep Kubernetes expertise, and strong GitOps workflows, ensuring availability and performance across critical systems. This role offers a dynamic work environment focused on automation and system reliability, along with flexible work-from-home options.

Benefits

Comprehensive Training & Development programs

Performance-based Bonus incentives

Flexible Work From Home options

Qualifications

8+ years in SRE / DevOps / Infrastructure Engineering roles.
Deep Kubernetes expertise including multi-cluster and Helm chart development.
Strong monitoring/observability background with tools like Prometheus and Grafana.

Responsibilities

Design and maintain production workloads across Kubernetes clusters.
Implement backup and disaster recovery strategies for critical services.
Lead incident management and ensure system observability.

Skills

Kubernetes expertise

GitOps workflows

Scripting/automation skills in Python

CI/CD practices

Communication skills

Education

Bachelor's degree in Computer Science or Engineering

Tools

AWS

Terraform

Prometheus

Grafana

We are looking for a Senior Site Reliability Engineer (SRE) to help design, scale, and secure our rapidly growing platform infrastructure. You will work across all critical systems — from customer-facing applications and APIs to internal platforms and data services — ensuring availability, performance, and cost efficiency at scale. You'll be hands‑on with Kubernetes, observability, GitOps, automation, and cloud infrastructure, while partnering closely with application, platform, and data teams to deliver a highly reliable and self‑healing environment. This role is ideal for an engineer who thrives on complex distributed systems, loves to automate everything, and can balance speed, stability, and cost‑efficiency in production.

Qualifications

Bachelor's degree in Computer Science, Engineering, or a related field — or equivalent work experience.
Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters.
Build self‑healing, auto‑scaling systems that minimize manual intervention and ensure uptime.
Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) within Kubernetes environments.
Implement backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets.
Troubleshoot and recover Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues).
Optimize storage performance and cost through multi‑tier strategies, hot/cold data separation, and S3/offloading lifecycle policies.
Secure and scale object storage platforms (e.g., MinIO/S3‑compatible) for high‑throughput data pipelines.
Manage block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) for resilience and cost balance.
Collaborate with teams to optimize networking, ingress/egress traffic, and service mesh for secure communication.

Platform & Infrastructure Reliability

Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters
Build self‑healing, auto‑scaling systems that minimize toil and manual intervention
Optimize networking, ingress/egress traffic control, and service mesh for secure & performant communication
Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) in Kubernetes environments
Own backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets for critical data services
Optimize storage performance and cost through multi‑tier strategies, hot/cold data separation, and S3/offloading lifecycle policies
Troubleshoot and recover Kubernetes Persistent Volumes confidently during incidents (StorageClasses, CSI drivers, PVC issues)
Secure and scale object storage platforms (e.g., MinIO/S3‑compatible) and integrate with workloads for high‑throughput data pipelines
Work with block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) to balance performance, resiliency, and cost

Automation & Delivery

Champion GitOps and CI/CD best practices (ArgoCD, Flux, GitHub Actions). Build automation for infrastructure provisioning and upgrades using Terraform, Helm, and Kubernetes Operators
Reduce release risk through progressive delivery strategies (blue/green, canary, spot instance rolling updates)

Observability & Incident Response

Own the monitoring and alerting stack (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch)
Lead incident management and postmortems to prevent recurrence
Provide real-time visibility into system health, performance, and cost metrics

Security & Compliance

Implement least‑privilege IAM policies, secure service‑to‑service communication, and network ACLs/firewalls
Enforce Kubernetes RBAC, secret management, and secure image supply chain
Participate in audit readiness and compliance efforts

Performance & Cost Optimization

Analyze and tune system performance under scale (CPU/memory/IO)
Partner with product and platform teams to right‑size clusters, databases, and storage tiers

Introduce cost visibility dashboards for engineering leadership.

Preferred Qualifications

Experience managing mission‑critical systems at scale (high traffic, multi‑region)
Proven cost optimization in cloud/K8s environments
Familiarity with service mesh (Istio, Linkerd) or advanced networking/egress control
Experience with data platform components (Airflow, Debezium, ClickHouse, etc.) is a plus but not required

Strong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.

Requirements

8+ years in SRE / DevOps / Infrastructure Engineering roles
Deep Kubernetes expertise (multi‑cluster, Helm chart development, advanced networking)
Strong GitOps workflows using ArgoCD/Flux
Expertise with AWS (preferred) or Azure/GCP, plus Infrastructure‑as‑Code (Terraform, Pulumi, CloudFormation)
Advanced knowledge of SQL & NoSQL databases (MySQL/Aurora, PostgreSQL, MongoDB, Redis)
Scripting/automation skills in Python, Bash, or Go
Solid background in monitoring/observability (Prometheus, Grafana, Loki, ELK/Opensearch, VictoriaMetrics)
Experience with CI/CD at scale and managing production incidents
Experience with streaming/messaging (Kafka, RabbitMQ, or similar)

Benefits

Comprehensive Training & Development programs
Performance‑based Bonus incentives
Flexible Work From Home options

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs