Enable job alerts via email!

Senior Site Reliability Engineer

Unifonic, Inc.

Lahore

Remote

PKR 2,000,000 - 2,750,000

Full time

Today
Be an early applicant

Job summary

A dynamic SaaS startup is seeking a Senior Site Reliability Engineer to ensure system reliability and enhance cloud infrastructure performance. You will drive continuous improvements across the distributed messaging platforms. Ideal candidates will have over 8 years of experience, strong skills in AWS and Kubernetes, and a passion for technology. Competitive salary and benefits offered.

Benefits

Competitive salary and bonus
Unifonic share scheme
30 holiday days after the first anniversary
Your Birthday off!
Work from anywhere up to 25 days per year
Paid leave for new parents
LinkedIn learning license

Qualifications

  • 8+ years of hands-on production experience in SRE, DevOps, or cloud engineering roles.
  • Strong expertise in AWS, OCI, OpenStack environments.
  • Deep understanding of Kubernetes ecosystems (EKS, OKE, Rancher RKE2).
  • Proven experience with distributed messaging and caching systems.

Responsibilities

  • Owning the reliability, uptime, and scalability of critical production services.
  • Participating in the on-call rotation to respond to incidents and troubleshoot live production issues.
  • Building robust operational playbooks and improving MTTD and MTTR.
  • Automating operational tasks to minimize human intervention.

Skills

AWS
Kubernetes
Kafka
RabbitMQ
Redis
MySQL
PostgreSQL
Automation skills

Education

Bachelor's or Master’s degree in Computer Science, Engineering, or a related field

Tools

Terraform
Helm
Jenkins
Job description
Overview

Proudly voted a Great Place to Work, we are a dynamic startup in the SaaS space that is revolutionizing the way businesses communicate. Our team is made up of 500 energetic and passionate Unifones who are dedicated to delivering the best possible experience to 5000+ customer-centric companies.

We pride ourselves on our fun and collaborative work environment, where creativity and new ideas are constantly encouraged. As shareholders in the business, we’re so much more than a group of passionate communicators. We are Unifones. Join our team and be a part of something big!

Meet the team!

Our Engineering team is responsible for designing, developing, and maintaining the systems and technologies that drive Unifonic’s solutions. We work closely with other departments to ensure our products and services meet the needs of our customers. If you are passionate about technology and are excited about working on cutting-edge communication and engagement solutions, we want you on our team.

Role

As a Senior Site Reliability Engineer you will be responsible for enhancing system reliability, scalability, and resilience. As part of our elite SRE team, you\u2019ll drive continuous improvement across our cloud infrastructure and ensure the consistent high performance of our distributed messaging platforms.

Responsibilities
  • Production Operations and Incident Management: Owning the reliability, uptime, and scalability of critical production services.
  • Participating in the on-call rotation to respond to incidents, troubleshoot live production issues, and lead post-incident analysis.
  • Building robust operational playbooks, escalation paths, and improve Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
  • Ensuring operational excellence by proactively detecting and addressing reliability risks through SLO monitoring, chaos testing, and capacity planning.
  • Automating operational tasks to minimize human intervention.
  • Cloud Architecture & Management: Architecting, implementing, and managing infrastructure across AWS, OCI, and OpenStack environments.
  • Optimizing cloud resources to balance performance, security, and cost-efficiency.
  • Kubernetes & Container Orchestration: Managing Kubernetes clusters (EKS, OKE, Rancher RKE2), ensuring scalability, availability, and robust performance.
  • Deploying advanced containerization strategies and troubleshooting.
  • Messaging, Caching & Queuing Systems: Managing and optimizing high-performance messaging and caching systems including Kafka, RabbitMQ, and Redis; ensuring efficient, reliable message and data delivery.
  • Database Reliability Engineering: Managing and optimizing production-grade MySQL and PostgreSQL databases; ensuring high availability, performance tuning, backups, and recovery processes.
  • Disaster Recovery & Business Continuity: Leading the planning and execution of comprehensive disaster recovery strategies; developing and maintaining robust business continuity plans.
  • Monitoring, Observability & Incident Management: Implementing advanced observability solutions (Prometheus, Grafana, CloudWatch); defining, measuring, and enforcing Service Level Objectives (SLOs) and Service Level Indicators (SLIs); proactively identifying issues, minimizing downtime, and enhancing system transparency.
  • Automation, CI/CD, and Infrastructure-as-Code: Driving automation initiatives using Terraform, Helm, Jenkins, Tekton or GitLab CI/CD; streamlining deployment pipelines and reducing manual intervention through automation.
  • Security & Compliance: Integrating security best practices into infrastructure and application layers; performing regular audits ensuring compliance and robust security posture.
  • Team Collaboration & Technical Leadership: Collaborating with cross-functional teams to foster SRE culture; mentoring junior engineers, enhancing team capabilities and promoting knowledge sharing.
What you\'ll bring
  • Bachelor\'s or Master\'s degree in Computer Science, Engineering, or a related technical field.
  • 8+ years of hands-on production experience in SRE, DevOps, or cloud engineering roles.
  • Strong expertise in AWS, OCI, OpenStack environments.
  • Deep understanding of Kubernetes ecosystems (EKS, OKE, Rancher RKE2).
  • Proven experience with Kafka, RabbitMQ, Redis, and distributed messaging and caching systems.
  • Solid experience managing MySQL and PostgreSQL in production environments.
  • Expert-level scripting and automation skills (Python, Bash, Go).
  • Advanced proficiency with Helm, Terraform, and modern CI/CD toolchains.
  • Demonstrable experience with Linux system administration and troubleshooting.
Benefits
  • Competitive salary and bonus
  • Unifonic share scheme (we are all owners!)
  • 30 holiday days after the first anniversary
  • Your Birthday off!
  • Spend up to 25 days per year working from anywhere in the world!
  • Paid leave and assistance for new parents
  • LinkedIn learning license
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.