Enable job alerts via email!

Head of Site Reliability Engineering

Shakudo

Toronto

On-site

CAD 120,000 - 160,000

Full time

22 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Shakudo is seeking a Head of Site Reliability Engineering to enhance the reliability and performance of its innovative data and AI platform. The ideal candidate will lead the SRE function, ensuring high availability and resilience while mentoring a high-performance team. This role involves cross-functional collaboration to architect scalable infrastructure and enforce best practices in incident response and CI/CD.

Qualifications

  • 8+ years in infrastructure, DevOps, or SRE roles.
  • Proven experience scaling distributed systems.

Responsibilities

  • Build and lead the SRE function, setting goals and technical direction.
  • Own uptime, reliability, and incident response for the platform.

Skills

Leadership
Communication
Collaboration

Tools

Kubernetes
Terraform
Prometheus
Grafana
Datadog

Job description

About the Job & Shakudo

At Shakudo, we are building the world’s first operating system for data and AI. We use the term operating system in the truest sense of the word. Like iOS, Windows and Linux, Shakudo’s end-to-end OS offers ever-evolving, automatically operated, best-of-breed open-source components tailored to each business's unique needs.

The Role

We are hiring a Head of Site Reliability Engineering to lead the reliability, availability, and performance strategy of our platform. This role is ideal for someone who thrives on solving infrastructure challenges, scaling cloud-native systems, and building high-performance teams.You will work cross-functionally with engineering, product, and customer success to make Shakudo’s platform rock-solid and resilient for our customers around the world.


What You’ll Do
  • Build and lead the SRE function at Shakudo, setting goals, technical direction, and driving team culture
  • Own uptime, reliability, and incident response for our platform
  • Architect scalable infrastructure using Kubernetes, cloud-native tooling, and automation frameworks
  • Lead the design of observability, monitoring, and alerting systems to proactively detect and prevent issues
  • Create and enforce best practices for CI/CD, disaster recovery, and service-level objectives (SLOs)
  • Partner closely with engineering and product to ensure new features are reliable and production-ready
  • Mentor engineers and help instill a culture of operational excellence
What We're Looking For
  • 8+ years of experience in infrastructure, DevOps, or SRE roles with increasing responsibility
  • Proven experience scaling distributed systems in a high-availability, production environment
  • Expertise with Kubernetes, Terraform, containerization, and at least one major cloud provider (AWS preferred)
  • Strong knowledge of system design, networking, and reliability principles
  • Experience with observability tools (e.g., Prometheus, Grafana, Datadog) and incident response practices
  • Strong leadership and communication skills, with a hands-on, collaborative approach
Nice to Have
  • Experience supporting data pipelines, ML workloads, or complex orchestration systems
  • Familiarity with the data/ML tooling ecosystem (e.g., Airflow, dbt, Spark, Dremio, etc.)
  • Previous experience in a startup or high-growth environment

Shakudo is an equal opportunity employer and encourages candidates of all backgrounds to apply. We foster diversity and inclusivity and welcome applications from a broad range of backgrounds and experiences.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Lead, Site Reliability Engineering, Infrastructure Security Toronto

MongoDB

Old Toronto

Remote

CAD 90.000 - 150.000

30+ days ago

Lead, Site Reliability Engineering, Infrastructure Security

MongoDB

Old Toronto

Remote

CAD 90.000 - 150.000

30+ days ago

Site reliability engineering lead

BMO Financial Group

Toronto

On-site

CAD 74.000 - 139.000

26 days ago

Team Lead, Site Reliability Engineering

Geotab

Oakville

Hybrid

CAD 90.000 - 130.000

15 days ago

Lead, Site Reliability Engineering, Infrastructure Security

MongoDB

Montreal

Remote

CAD 120.000 - 160.000

16 days ago

Lead, Site Reliability Engineering, Infrastructure Security

MongoDB

Montreal

Remote

CAD 100.000 - 125.000

30+ days ago