Enable job alerts via email!

Senior Site Reliability Engineer - Remote

Kablamo Pty Ltd

Toronto

Hybrid

CAD 100,000 - 130,000

Full time

Yesterday
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading cloud digital product development company is seeking a Sr. Site Reliability Engineer in Toronto to support AWS infrastructure and ensure system reliability. The role involves proactive monitoring, incident response, and collaboration with development teams, allowing for professional growth within a dynamic and innovative environment. Join a diverse team committed to delivering cutting-edge solutions while promoting a culture of inclusivity and employee well-being.

Benefits

Remote first with downtown Toronto office
Work abroad for up to 3 weeks per year
Career growth opportunities
Paid birthday leave
Anniversary bonus
Referral bonus
Parental Leave top up
Employee Assistance Program
Swag

Qualifications

  • 5+ years' experience in an SRE or DevOps role.
  • Deep understanding of system architecture and design principles.
  • Experience with AWS and its services.

Responsibilities

  • Contribute to the design and maintenance of AWS infrastructure.
  • Actively respond to and resolve system incidents.
  • Develop automated solutions for operational tasks.

Skills

Critical thinking
Problem-solving
Troubleshooting
Proactivity
Cross-functional collaboration

Education

Bachelor’s degree in computer science or similar technical qualification

Tools

AWS CloudWatch
Datadog
Grafana
Prometheus
Jira Service Management

Job description

Kablamo is a fast-growing cloud digital product development company. Founded in 2017 in Australia, the business has grown quickly over the last several years, including the expansion of the team to Canada in 2021. We are proud to have assembled an amazing list of customers, including some of the best known enterprise and government organizations, in Australia and Canada. We’re looking to further accelerate our growth in both markets, and we’re seeking a Sr. Site Reliability Engineer to help us support new products to market.

Kablamo is proud to be an Advanced AWS Consulting partner, and we have recently been recognised as a global leader in designing and building cloud-based data and AI / ML solutions. At the 2021 AWS Global Public Sector conference, Kablamo won the award for “Most Innovative AI / ML Solution” for our work building bushfire prediction data platforms in Australia – we were selected from more than 1,800 AWS global partners.

The Role

As we expand the capability across our Product Care offering, we are looking for a Sr. Site Reliability Engineer (SRE) to help us build our capability and deliver insights from massive scale data in real time. The Sr. SRE role is responsible for developing automated solutions for operational aspects such as on-call monitoring, performance and capacity planning, and disaster response. The role will complement our ongoing development teams, looking at continuous delivery and infrastructure automation.

As the bridge between development and operations, you will be our primary escalation point across key customer accounts.

Key Responsibilities :

  • Contribute to the design, implementation, and maintenance of our AWS infrastructure
  • Be proactive in anticipating production issues. Assess risks and mitigate against these, planning for contingencies and counter-measures in advance
  • Ensuring reliability to get systems back to a steady state by quickly investigating and fixing performance, stability and scalability issues, ensuring Kablamo is able to meet SLA and SLO requirements
  • Responsible for ensuring that the underlying infrastructure is running smoothly and that systems and tools are working as expected. You will be assessing risks and mitigating against these or planning appropriate contingencies and counter-measures in advance
  • Develop or implement visual tools for technical and business teams to observe system health and supporting the Technical Account Manager in reporting on reliability metrics
  • Use automation tools to solve problems, writing and developing code to automate processes, such as analysing logs and testing production environments
  • Working with the engineering and / or development team to identify recurring problems which can be resolved through automation
  • Responsible for enhancing performance, efficiency and monitoring of software development processes
  • Act on system incidents; as the SRE you are a key contact involved in incident response and resolutions including active collaboration in any PIRs / Post-mortems
  • Collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. Actively collaborating with the development team to define fields for logging and tracing.
  • Being a voice to advocate for reliability against competing priorities
  • Helping prepare activities for production release, including facilitating training and enablement of client technical teams and / or attending appropriate meetings (Technical Working Groups, Architecture Review Boards, Change Advisory Boards)

Required skills and experience :

  • 5+ years’ experience in an SRE or DevOps role
  • Deep understanding of system architecture and design principles
  • Ability to think critically and problem solve, providing good performance under pressure
  • Troubleshooting experience with the ability to clearly communicate to customers or the engineering team
  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Experience with AWS and its services (Serverless, Deployment Tools, Networking, Containerization, Security, Cost Management)
  • Familiarity with tools such as AWS CloudWatch, Datadog, Grafana, Prometheus, Scalyr, PagerDuty, OpsGenie, Jira Service Management
  • Ability to work cross functionally with support engineering, development teams and / or client vendors to deliver sound outcomes and suggest system improvements
  • Understanding of security requirements and implications and can conform to applicable security frameworks
  • An in-depth knowledge of version control
  • Experience with production rollback
  • Knowledge of fundamental network concepts and protocols
  • A good understanding of DevOps concepts and best practices including Infrastructure-as-Code

Bonus Points for :

  • Bachelor’s degree in computer science or other similar technical qualification
  • AWS Associate and / or Professional Level Certifications
  • Strong grasp of networking, security, and reliability fundamentals
  • Solid understanding of Agile methodologies and practices
  • Lead SRE

Hiring Process :

  • 30-min intro chat with our TA team
  • 1-hr Technical interview
  • 1-hr Final Interview
  • References
  • Offer!

Why Work at Kablamo?

Our Culture

We acknowledge a workplace that is diverse and inclusive, enables for greater innovation and produces benefits including improved performance, improved employee happiness and wellbeing, and superior outcomes for our customers. We attribute our success to all our unique and charismatic Kablamites. Through our fortnightly back to base and our debate Thunderdomes, we enable our Kablamites to provide feedback, share ideas, challenge the status quo and technically challenge each other constructively.

The PERKS!!!

  • Remote first with a downtown Toronto office available
  • Work abroad for up to 3 weeks per year (some restrictions apply)
  • Career growth (we really do promote from within!)
  • Online rewards platform
  • Paid birthday leave
  • Anniversary bonus
  • Referral bonus
  • Parental Leave top up
  • Employee Assistance Program
  • Swag

Kablamo is a proud equal opportunity employer. We make our hiring decisions solely based on your skills and experience, as well as the perspectives and value you can bring to our team. Kablamo believes that diversity is vital to provide the best service to our clients and we are committed to fostering a varied and inclusive work environment. Every effort to accommodate candidates for accessibility will be made upon request. Information received related to accommodations will be addressed confidentially.

Kablamo would like to thank all candidates for their interest however only qualified applicants will be shortlisted.

Role Type

Company Overview

Are you interested in joining one of Australia’s best cloud product development companies? Our team uses cutting-edge cloud technology to design and build digital products and data platforms that deliver transformational change. We’re helping our customers to build digital solutions to manage bushfire risk, perform genomics research on deadly diseases, launch new fintech, deliver millions of hours of media content to viewers, rethink welfare programs for disadvantaged communities, and much more. At the 2021 AWS Global Public Sector conference, Kablamo won the global award for “Most Innovative AI / ML Solution” – we were selected from more than 1,800 AWS global partners! The AWS award was for Kablamo’s work with Victoria’s Department of Environment, Land, Water & Planning to help them predict and manage bushfire risk for the State of Victoria.

J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • Toronto, ON, Canada

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Senior Turbine Reliability Engineer

Ctrl

Toronto null

Remote

Remote

CAD 80,000 - 110,000

Full time

Yesterday
Be an early applicant

Senior Machine Safety Engineer

Jobot

Toronto null

Remote

Remote

CAD 110,000 - 140,000

Full time

Today
Be an early applicant

Senior Site Reliability Engineer

Ampcus Incorporated

Toronto null

Remote

Remote

CAD 110,000 - 150,000

Full time

8 days ago

Senior Site Reliability Engineer

Canonical

Montreal null

Remote

Remote

CAD 90,000 - 110,000

Full time

Yesterday
Be an early applicant

Senior Site Reliability Engineer

Canonical

Mississauga null

Remote

Remote

CAD 120,000 - 180,000

Full time

30+ days ago

Senior Machine Learning Engineer, Platform

theScore

Toronto null

Remote

Remote

CAD 120,000 - 150,000

Full time

8 days ago

Senior Site Reliability Engineer

Canonical

Toronto null

Remote

Remote

CAD 100,000 - 150,000

Full time

30+ days ago

Senior Site Reliability Engineer

Circle

Toronto null

On-site

On-site

CAD 120,000 - 163,000

Full time

10 days ago

Senior Site Reliability Engineer AWS, Monitoring tools Rqd

Thomas Reuters

Toronto null

Hybrid

Hybrid

CAD 100,000 - 130,000

Full time

5 days ago
Be an early applicant