Enable job alerts via email!

Site Reliability Engineer (SRE)

Astra Tech

Abu Dhabi

On-site

AED 60,000 - 100,000

Full time

3 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative tech company is seeking a Site Reliability Engineer to enhance infrastructure reliability and efficiency. This role focuses on automating operational tasks, managing Kubernetes clusters, and implementing CI/CD pipelines. The ideal candidate will have a strong background in Shell scripting, Python or Go, and experience with Prometheus for monitoring. Join a forward-thinking organization committed to revolutionizing technology solutions and improving digital experiences globally. If you are passionate about operational excellence and eager to make a significant impact, this opportunity is perfect for you.

Qualifications

  • 5+ years in Operations & Maintenance, with 2+ years in Kubernetes administration.
  • Proficiency in Shell scripting and experience with Python or Go.

Responsibilities

  • Automate operational tasks using Shell scripting for efficiency.
  • Administer and optimize Kubernetes clusters for performance.

Skills

Shell Scripting
Python
Go
Kubernetes Administration
Prometheus
CI/CD Processes
Infrastructure as Code (IaC)
Problem Solving

Tools

Terraform/OpenTofu

Job description

About Us

Established in 2022, Astra Tech has rapidly expanded its influence by strategically acquiring and developing key platforms such as PayBy, Rizek, Quantix, and Botim. These acquisitions have culminated in the creation of the world's first Ultra App, Botim, which seamlessly integrates fintech, e-commerce, AI-powered tech solutions, and communication services into one intuitive and user-friendly experience. This powerful combination allows users to manage their finances, shop, and stay connectedall within a single, cohesive platform.

With over 150 million users across 155 countries, Astra Tech is more than just a tech companyit is a movement committed to enhancing lives through innovation. As a visionary leader in tech development and investment, our mission is clear: to revolutionize technology solutions for consumers and businesses, harnessing the power of AI to elevate digital experiences to unprecedented heights globally.

Role Summary

As a Site Reliability Engineer, you will be responsible for enhancing the reliability, scalability, and efficiency of our infrastructure and operations. You will automate routine tasks, optimize middleware components, manage Kubernetes clusters, and maintain robust monitoring systems using Prometheus. The role also involves contributing to CI/CD pipeline development, managing cloud resources with a focus on cost optimization, and driving improvements in operational processes through automation and proactive incident resolution.

Key Responsibilities

  • Automate routine operational tasks using Shell scripting, ensuring efficiency in log analysis, batch management, and system optimization.
  • Maintain and optimize middleware components supporting infrastructure operations, ensuring stability and performance.
  • Administer and optimize Kubernetes clusters, ensuring scalability, security, and performance.
  • Maintain and optimize monitoring and alerting systems based on Prometheus, ensuring high availability of services.
  • Contribute to the development of CI/CD pipelines
  • Manage cloud resources efficiently, implementing cost optimization strategies to reduce cloud expenditure.
  • Improve operational processes, develop automation tools, troubleshoot incidents, and enhance system stability and reliability.

Key Requirements

  • Proficiency in Shell scripting for automating operational workflows and system management tasks.
  • Experience in Python or Go, preferably for system automation, tooling, or backend services.
  • At least 5 years experience in Operation & Maintenance-related job experience. At least 2 years of hands-on Kubernetes administration experience, including expertise in CSI, CNI, and managing clusters with 20+ nodes in production.
  • Experience with Prometheus for monitoring and alerting in an enterprise environment.
  • Familiarity with CI/CD deployment processes, with knowledge of GitOps principles. Hands-on experience with GitOps is a plus.
  • Experience managing cloud platforms using Infrastructure as Code (IaC) tools like Terraform/OpenTofu. Azure experience is a plus.
  • Strong problem-solving skills, a proactive approach to troubleshooting, and a commitment to improving operational efficiency and system reliability.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.