Position Title : Chaos Engineering Expert (Resiliency & Performance Engineer)
Location : Riyadh / Hybrid (On-prem + Off-shore)
Employment Type : Full-time
Reports to : Head of Infrastructure / SRE / Platform Engineering
Role Overview
We are seeking an experienced Chaos Engineering Expert to help strengthen the resiliency, performance, and security posture of our hybrid infrastructure. In this role, you will design, execute, and analyze chaos experiments across our on-premises servers, databases, and application services, and work collaboratively with our team to embed resilience into our systems. Deliver an insight maturity report including improvements recommendations for system architecture, operational processes, and incident response, enabling us to anticipate and mitigate failures before they impact customers.
Responsibilities
- Design, plan, and implement chaos engineering experiments across all layers of our infrastructure (physical / virtual servers, network, storage, databases, applications, and services).
- Develop hypotheses (failure scenarios), define metrics, and create success criteria for experiments.
- Execute fault-injection / chaos tests (either in pre-production, staging, or controlled production environments), ensuring minimal risk to business operations.
- Monitor and instrument system behavior during experiments using observability tools.
- Analyze the results of experiments, identify vulnerabilities, failure modes, and weak points; derive actionable recommendations.
- Collaborate with DevOps, SRE, DBAs, security, network, BCM and operations teams to remediate issues uncovered by experiments and comply with systems RTO.
- Integrate chaos experiments into the CI / CD pipeline or as part of release / reliability practices.
- Build a chaos framework suitable for our hybrid environment.
- Document all experiments, including design, configuration, execution details (drills), results, lessons learned, and corrective actions.
- Develop and maintain runbooks, playbooks, and operational procedures for resilience testing.
- Participate in post-incident reviews, injecting learnings from chaos experiments into incident response and root cause analysis.
Technical Skills & Experience
- 6+ years of experience in site reliability engineering (SRE), performance engineering, and infrastructure engineering.
- Proven track record of designing and executing fault injection, resilience testing, chaos experiments.
- Deep understanding of on-premises infrastructure : physical and virtual servers, hypervisors, networking, storage.
- Experience with database systems (e.g., SQL, NoSQL) and how they fail / recover.
- Familiarity with application stacks, microservices, and distributed architectures.
- Proficiency in one or more languages used for automation or scripting (e.g., Python, Go, Java, or similar).
- Hands-on experience with tools such as Chaos Monkey, Gremlin, Chaos Mesh, Litmus Chaos, Toxiproxy, AWS Fault Injection Simulator (FIS), Azure Chaos Studio, or similar.
- Strong skills in monitoring, metrics, logging, and tracing (e.g., OpenText SiteScope, Datadog,).
- Experience integrating chaos testing into CI / CD pipelines and infrastructure-as-code workflows.
- Good understanding of security vulnerabilities and how fault injection might surface security risks.
- Familiar with risk assessment, threat modeling, or security hardening practices.
- Ability to work across teams (DevOps, DBAs, Ops, Security) and communicate complex findings in a clear manner.
- Strong documentation skills — proven ability to write detailed experiment designs, results, remediations, and technical playbooks.
Analytical Skills
- Strong analytical mindset, capable of interpreting results, identifying root causes, and recommending mitigations.
Certifications & Education
Education : Bachelor’s degree in computer science, Engineering, or a related technical field.
- Certifications or formal training in chaos engineering (or resilience engineering) is a must.
- Chaos Engineering Fundamentals certificate is preferred.
- Certifications in SRE, or DevOps (e.g., Gitlab, Azure, Google Cloud, Kubernetes) are beneficial.
- Security certifications are a plus as security vulnerability testing is part of chaos experiments.