Enable job alerts via email!

Senior Site Reliability Engineer- ELK Expert

iVedha Inc.

Ontario

On-site

CAD 100,000 - 140,000

Full time

Yesterday
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a leading tech company as a Senior Site Reliability Engineer, where you will oversee large-scale observability infrastructure using ELK and cloud technologies. This highly impactful role requires extensive experience in SRE, driving automation and ensuring performance at scale. Be part of a dynamic team, working on cutting-edge projects in a collaborative environment to enhance system reliability and efficiency.

Benefits

Career Growth Opportunities
Competitive Compensation and Benefits
Exciting and Impactful Work

Qualifications

  • 7+ years in Site Reliability Engineering, DevOps, or Cloud Engineering.
  • 4+ years with ELK stack (Elasticsearch, Logstash, Kibana).
  • Proficient in automation using Python, Go, or Bash.

Responsibilities

  • Design and optimize cloud infrastructure on Microsoft Azure.
  • Manage and scale ELK clusters handling multi-TB log volumes.
  • Implement security best practices and enhance reliability.

Skills

Site Reliability Engineering
ELK Expertise
Automation
Cloud-native Architectures
Collaboration Skills

Education

Bachelor’s or Master’s degree in Computer Science

Tools

GitHub Actions
Terraform
Ansible
Prometheus
Grafana
Azure Monitor
Kubernetes
Docker

Job description

Role Summary :

Are you a Senior Site Reliability Engineer (SRE) with deep ELK expertise, ready to take ownership of large-scale observability infrastructure?

We're looking for an SRE with 7+ years of experience , including 4+ years specializing in the ELK stack (Elasticsearch, Logstash, Kibana) , to join our Platform Engineering Practice . In this role, you’ll design, manage, and scale ELK clusters ingesting 2–3+ TB / day , enhance reliability across distributed systems, and drive automation within Azure cloud environments. This is a high-impact engineering opportunity focused on performance, observability, and operational excellence at scale.

Why Join Us

  • Career Growth : Work alongside industry experts on cutting-edge cloud technologies
  • Competitive Compensation and Benefits : We recognize and reward top talent
  • Exciting, Impactful Work : Design and build scalable, resilient cloud environments
  • Strategic Platform Role : Contribute to the foundation of next-gen observability and reliability infrastructure

What You Will Do

  • Design and Optimize Cloud Infrastructure : Architect scalable, fault-tolerant systems on Microsoft Azure
  • Automate Everything : Use Terraform, Ansible, and GitHub Actions to streamline deployment and configuration
  • Ensure Reliability and Performance : Proactively monitor, troubleshoot, and resolve production issues using Prometheus, Grafana, and Azure Monitor
  • Enhance Security and Compliance : Implement security best practices across DevOps workflows
  • Collaborate and Innovate : Work closely with engineering, security, and operations teams to drive automation and efficiency
  • Manage and scale large ELK clusters handling 2–3+ TB / day log volumes, ensuring high availability and performance
  • Optimize ELK architecture : Implement efficient index lifecycle policies, shard strategies, and hot-warm-cold tiered storage
  • Build and tune log pipelines : Scale Logstash and Beats pipelines across distributed environments
  • Support Kibana observability layers : Create dashboards, visualizations, and custom alerting frameworks (e.g., Watcher, ElastAlert)

What You Bring

  • 7+ years of experience in Site Reliability Engineering, DevOps, or Cloud Engineering
  • 4+ years of dedicated, hands-on experience with ELK (Elasticsearch, Logstash, Kibana)
  • Strong experience managing large-scale ELK clusters in production with heavy ingestion (multi-TB / day)
  • Deep knowledge of index tuning, shard allocation, ILM policies , and scaling ELK components
  • Expertise in GitHub Actions, Terraform, Ansible, and Infrastructure as Code (IaC)
  • Proficiency in Python, Go, or Bash for automation and scripting
  • Deep understanding of Kubernetes, Docker , and cloud-native architectures
  • Experience with observability tools such as Prometheus, Grafana, Azure Monitor
  • Ability to work in a fast-paced, collaborative environment and solve complex operational issues

Education

  • Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field

Certifications (Nice to Have)

  • Microsoft Azure certifications : AZ-104 , AZ-400
Create a job alert for this search

Site Reliability Engineer • Greater Toronto Area, Canada, Canada

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.