Enable job alerts via email!

Production Support Cloud Engineer

Khonology (Pty) Ltd

Rosebank

On-site

ZAR 500 000 - 700 000

Full time

10 days ago

Job summary

A digital services company in Rosebank is looking for two Production Support Cloud Engineers. The role involves providing L2/L3 support for AWS platform components, monitoring workloads, and ensuring operational continuity. The ideal candidates will have 3–5 years of experience in cloud-native environments, a solid understanding of AWS services, and experience with automation scripting. This role offers an opportunity to collaborate with various teams to enhance operational efficiency.

Qualifications

  • 3–5 years of production support or site reliability experience in cloud-native environments.
  • Solid understanding of AWS services including EKS, RDS, CloudWatch, IAM, S3, and Lambda.
  • Experience with Kubernetes, Helm, GitOps, and CI/CD pipelines.

Responsibilities

  • Provide L2/L3 production support for platform components running on AWS.
  • Monitor workloads and troubleshoot incidents, coordinating resolution.
  • Maintain operational dashboards and alerts, ensuring uptime compliance.

Skills

Production support experience
Site reliability engineering
AWS EKS
AWS RDS
CloudWatch
S3
Kubernetes
CI/CD pipelines
Typescript
Python
Bash scripting
Go scripting
Grafana
Incident management practices

Tools

Argo CD
GitHub Actions
Helm
Prometheus
Loki
Job description

Khonology is a digital services company focused on software development, Application Support, data analytics and engineering.

We are looking for two Production Support Cloud Engineers to ensure the stability, performance, and operational continuity of a platform and its hosted workloads during the November – January period, coinciding with the year-end change freeze and early-year restart window.

The platform is an internal developer platform designed to accelerate secure, compliant, and scalable application delivery on AWS. It provides teams with self-service onboarding, reusable golden paths, runtime patterns, and built-in FinOps and observability guardrails.

Key Responsibilities

Provide L2/L3 production support for platform components running on Amazon EKS, RDS, Lambda, S3, and CloudFlare.

Monitor workloads, troubleshoot incidents, and coordinate resolution with platform and development teams.

Manage and triage service requests, incident queues, and change controls within ITSM workflows.

Maintain operational dashboards and Grafana/CloudWatch alerts, ensuring uptime and SLO compliance.

Execute post-incident root cause analyses (RCAs) and document permanent fixes in runbooks.

Support deployment automation and GitOps processes (Argo CD, GitHub Actions, Helm).

Validate compliance of services with security, reliability, and cost optimisation standards.

Collaborate with Platform engineers to automate recurring tasks and improve operational efficiency.

Ensure backup verification, log retention, and audit readiness for all managed components.

Required Skills & Experience

3–5 years of production support or site reliability experience in cloud-native environments.

Solid understanding of AWS EKS, RDS, CloudWatch, IAM, S3, and Lambda.

Experience with Kubernetes, Helm, GitOps, and CI/CD pipelines.

Competence in Typescript, Python, Bash, or Go scripting for automation.

Familiarity with Grafana, Prometheus, Loki, and incident management practices (ITIL).

Strong communication skills and ability to collaborate across platforms, security, and development teams.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.