Enable job alerts via email!

Senior Site Reliability Engineer (Observability)

GuruLink

Richmond Hill

On-site

CAD 90,000 - 120,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A technology recruitment firm is seeking a Site Reliability Engineer in Richmond Hill, Ontario. The ideal candidate will have over 5 years of experience with cloud infrastructure, strong skills in Python, Kubernetes, and AWS services. You will lead incident responses and develop operational tools to enhance system efficiency. This full-time role requires on-call rotation and offers competitive compensation.

Qualifications

5+ years of SRE, DevOps, or Cloud Engineering experience.
Expert troubleshooting and analytical skills.
Deep knowledge of Linux fundamentals and networking (TCP/IP).

Responsibilities

Lead incident response and perform Root Cause Analysis.
Define and maintain robust observability solutions.
Design operational tooling to automate tasks.

Skills

Python scripting

Docker

Kubernetes

AWS services

Root Cause Analysis

Education

Bachelor’s degree in Computer Science or related field

Tools

Terraform

Ansible

Prometheus

Location: Richmond Hill, Ontario

About the Team

Our client’s platform engineering group operates with a Site Reliability Engineering (SRE) mindset, committed to delivering highly reliable, scalable, and performant systems across a public cloud infrastructure. The team specializes in enhancing system transparency, enabling deep diagnostics, and ensuring seamless collaboration between development and operations. Shared ownership, proactive problem‑solving, and continuous improvement are at the core of everything they do.

The Opportunity

As a Site Reliability Engineer, you will be responsible for the design, development, deployment, and further management and support of public cloud infrastructure. The candidate should have experience with designing highly available and fault tolerant cloud native enterprise solutions. As well as some background in development, the candidate should also have familiarity with Kubernetes. The role requires someone with experience interfacing with development teams throughout the full development lifecycle to produce reliable and secure production infrastructure and operate in multiple environments in the SDLC.

What You’ll Be Doing

Lead incident response and perform Root Cause Analysis (RCA) to prevent recurrence and improve system resilience.
Define, build, and maintain robust observability solutions (monitoring, metrics, logging, and alerting) for infrastructure and applications.
Design and develop operational tooling to automate repetitive tasks and improve system efficiency.
Develop and maintain Infrastructure as Code (IaC) for Kubernetes cluster management and AWS resource provisioning.
Maintain and evolve existing infrastructure and automation codebases (IaC).
Interface with and support development teams to migrate on‑premises solutions to the public cloud.

Must‑Haves

Bachelor’s degree in Computer Science or related field.
5+ years of SRE, DevOps, or Cloud Engineering experience.
Strong proficiency in Python for scripting and tooling, with additional experience in either Node.js or Java.
Expert troubleshooting and analytical skills with a proven ability to conduct Root Cause Analysis (RCA).
Hands‑on experience with containerization (Docker) and orchestration (Kubernetes).
Deep knowledge of Linux fundamentals, networking (TCP/IP), and core OS concepts.
Experience with Infrastructure as Code (IaC) tools such as Terraform, SaltStack, or Ansible.
Experience with AWS services (Compute, Storage, Networking).
Proven experience with metric‑based monitoring tools (e.g., Prometheus) and alerting systems.
Proficiency with web servers such as Nginx, with a solid understanding of how web servers work.
Ability to read, write, and debug production‑level code to trace complex application flows.

Nice‑to‑Haves

Experience with Elasticsearch and Application Performance Monitoring (APM) tools.
Experience with ArgoCD and advanced CI/CD pipelines.
Experience with large‑scale, multi‑region cloud projects.
AWS Associate Certification or higher.
Detailed knowledge of AWS services: EC2, S3, VPC, ELB/NLB/ALB, Lambda, and CloudWatch.
Experience with Cloudflare or equivalent Content Delivery Network (CDN) solutions.

This is a full‑time position. Days and hours of work are Monday through Friday, during normal business hours. This position will also participate in on‑call rotation which will be 2 weeks of primary and 2 weeks of secondary. This is offering 24/7 support for the platform during these rotations. Typically, this is 4 out of every 8 weeks.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.