Enable job alerts via email!

Chaos Engineering Expert

AINS Group

Riyadh

Hybrid

SAR 150,000 - 200,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading tech firm in Riyadh seeks a Chaos Engineering Expert to enhance the resiliency and performance of its hybrid infrastructure. In this role, you will design and execute chaos experiments, analyze results, and recommend improvements. The ideal candidate should have over 6 years of relevant experience and a Bachelor's in computer science or a related field. Certifications in chaos engineering and familiarity with tools like Chaos Monkey are essential. This full-time position offers a hybrid work model.

Qualifications

6+ years of experience in site reliability engineering, performance engineering, and infrastructure engineering.
Proven experience in designing and executing fault injection and chaos experiments.
Education in computer science or related technical field is a must.

Responsibilities

Design and implement chaos engineering experiments across infrastructure layers.
Monitor system behavior during chaos tests and analyze results.
Collaborate with cross-functional teams to remediate issues identified during tests.

Skills

Site reliability engineering

Performance engineering

Infrastructure engineering

Fault injection

Resilience testing

Python

Monitoring

Documentation

Education

Bachelor’s degree in computer science

Chaos Engineering Fundamentals certificate

Tools

Chaos Monkey

Gremlin

Datadog

Position Title : Chaos Engineering Expert (Resiliency & Performance Engineer)

Location : Riyadh / Hybrid (On-prem + Off-shore)

Employment Type : Full-time

Reports to : Head of Infrastructure / SRE / Platform Engineering

Role Overview

We are seeking an experienced Chaos Engineering Expert to help strengthen the resiliency, performance, and security posture of our hybrid infrastructure. In this role, you will design, execute, and analyze chaos experiments across our on-premises servers, databases, and application services, and work collaboratively with our team to embed resilience into our systems. Deliver an insight maturity report including improvements recommendations for system architecture, operational processes, and incident response, enabling us to anticipate and mitigate failures before they impact customers.

Responsibilities

Design, plan, and implement chaos engineering experiments across all layers of our infrastructure (physical / virtual servers, network, storage, databases, applications, and services).
Develop hypotheses (failure scenarios), define metrics, and create success criteria for experiments.
Execute fault-injection / chaos tests (either in pre-production, staging, or controlled production environments), ensuring minimal risk to business operations.
Monitor and instrument system behavior during experiments using observability tools.
Analyze the results of experiments, identify vulnerabilities, failure modes, and weak points; derive actionable recommendations.
Collaborate with DevOps, SRE, DBAs, security, network, BCM and operations teams to remediate issues uncovered by experiments and comply with systems RTO.
Integrate chaos experiments into the CI / CD pipeline or as part of release / reliability practices.
Build a chaos framework suitable for our hybrid environment.
Document all experiments, including design, configuration, execution details (drills), results, lessons learned, and corrective actions.
Develop and maintain runbooks, playbooks, and operational procedures for resilience testing.
Participate in post-incident reviews, injecting learnings from chaos experiments into incident response and root cause analysis.

Technical Skills & Experience

6+ years of experience in site reliability engineering (SRE), performance engineering, and infrastructure engineering.
Proven track record of designing and executing fault injection, resilience testing, chaos experiments.
Deep understanding of on-premises infrastructure : physical and virtual servers, hypervisors, networking, storage.
Experience with database systems (e.g., SQL, NoSQL) and how they fail / recover.
Familiarity with application stacks, microservices, and distributed architectures.
Proficiency in one or more languages used for automation or scripting (e.g., Python, Go, Java, or similar).
Hands-on experience with tools such as Chaos Monkey, Gremlin, Chaos Mesh, Litmus Chaos, Toxiproxy, AWS Fault Injection Simulator (FIS), Azure Chaos Studio, or similar.
Strong skills in monitoring, metrics, logging, and tracing (e.g., OpenText SiteScope, Datadog,).
Experience integrating chaos testing into CI / CD pipelines and infrastructure-as-code workflows.
Good understanding of security vulnerabilities and how fault injection might surface security risks.
Familiar with risk assessment, threat modeling, or security hardening practices.
Ability to work across teams (DevOps, DBAs, Ops, Security) and communicate complex findings in a clear manner.
Strong documentation skills — proven ability to write detailed experiment designs, results, remediations, and technical playbooks.

Analytical Skills

Strong analytical mindset, capable of interpreting results, identifying root causes, and recommending mitigations.

Certifications & Education

Education : Bachelor’s degree in computer science, Engineering, or a related technical field.

Certifications or formal training in chaos engineering (or resilience engineering) is a must.
Chaos Engineering Fundamentals certificate is preferred.
Certifications in SRE, or DevOps (e.g., Gitlab, Azure, Google Cloud, Kubernetes) are beneficial.
Security certifications are a plus as security vulnerability testing is part of chaos experiments.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs