Enable job alerts via email!

Senior Site Reliability Engineer (Cloud Native & Observability)

Mission Consultancy Services Sdn. Bhd.

Putrajaya

On-site

MYR 120,000 - 150,000

Full time

4 days ago
Be an early applicant

Job summary

A leading consultancy firm in Malaysia seeks a Senior Site Reliability Engineer to create resilient, scalable multi-cloud systems. Responsibilities include managing Kubernetes clusters, designing CI/CD pipelines, and ensuring system observability. Candidates should have a strong DevOps background with experience in monitoring tools and scripting. This role offers a competitive salary and growth opportunities.

Qualifications

  • Minimum 8 years of experience in DevOps/SRE, with at least 5 in SRE roles.
  • Familiarity with Terraform, Ansible, or similar automation tools.
  • Experience with AWS, Azure, or GCP.

Responsibilities

  • Architect and maintain multi-region Kubernetes clusters.
  • Implement full-stack observability using OpenTelemetry and Grafana.
  • Design and manage CI/CD pipelines with GitOps approach.

Skills

Infrastructure as code
Kubernetes ecosystem stability
Monitoring tools (Prometheus, Grafana)
CI/CD pipelines
Scripting (Python, Bash, Go)

Education

Bachelor's in Computer Science or Engineering

Tools

Terraform
GitHub Actions
Helm
Prometheus
Grafana

Job description

Role Overview:

Responsible for highly resilient, scalable, and cost-optimized systems on multi-cloud environments. Focus on infrastructure as code, observability, chaos engineering, and Kubernetes ecosystem stability across distributed systems used by millions of users.

  • Design and implement scalable and reliable systems.
  • Monitor system health using tools like Prometheus, Grafana, or Datadog.
  • Manage CI/CD pipelines and infrastructure automation (e.g., Jenkins, GitHub Actions).
  • Troubleshoot incidents and ensure root cause analysis is completed.
  • Work with DevOps and development teams to improve system performance.
  • Build tools to automate operations and reduce manual intervention (IaC).


Key Responsibilities:

  • Architect and maintain multi-region Kubernetes clusters (AKS/EKS/GKE) with Istio/Linkerd service mesh.
  • Implement full-stack observability using OpenTelemetry, Grafana Loki, and Jaeger.
  • Build self-healing infrastructure with tools like KEDA, Argo CD, Crossplane.
  • Design and manage CI/CD pipelines with GitOps approach (FluxCD/Argo CD).
  • Conduct chaos testing using Gremlin or LitmusChaos to validate system resilience.
  • Work with finance and ops teams on FinOps strategies for optimizing cloud usage (Spot instances, autoscaling policies).
  • Implement policy-as-code for security compliance via OPA/Gatekeeper.


Technology Stack:

  • Languages: Go, Python, Bash
  • Cloud: AWS, Azure, GCP
  • IaC Tools: Terraform, Helm, Pulumi
  • Observability: Prometheus, Grafana, ELK, New Relic
  • Certifications Preferred: CKA, CKAD, Terraform Associate, Google SRE, FinOps Certified Practitioner


Requirements:

  • Minimum 8 years of experience in DevOps/SRE, with at least 5 years specifically in Site Reliability Engineering roles
  • Bachelor's in Computer Science, Engineering, or related.
  • Familiarity with Terraform, Ansible, or other automation tools.
  • Experience with public cloud (AWS, Azure, or GCP).
  • Strong scripting skills (Python, Bash, or Go).
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.