Enable job alerts via email!

Senior Site Reliability Engineer - Remote

Kablamo Pty Ltd

Toronto

Hybrid

CAD 100,000 - 130,000

Full time

Yesterday

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading cloud digital product development company is seeking a Sr. Site Reliability Engineer in Toronto to support AWS infrastructure and ensure system reliability. The role involves proactive monitoring, incident response, and collaboration with development teams, allowing for professional growth within a dynamic and innovative environment. Join a diverse team committed to delivering cutting-edge solutions while promoting a culture of inclusivity and employee well-being.

Benefits

Remote first with downtown Toronto office

Work abroad for up to 3 weeks per year

Career growth opportunities

Paid birthday leave

Anniversary bonus

Referral bonus

Parental Leave top up

Employee Assistance Program

Swag

Qualifications

5+ years' experience in an SRE or DevOps role.
Deep understanding of system architecture and design principles.
Experience with AWS and its services.

Responsibilities

Contribute to the design and maintenance of AWS infrastructure.
Actively respond to and resolve system incidents.
Develop automated solutions for operational tasks.

Skills

Critical thinking

Problem-solving

Troubleshooting

Proactivity

Cross-functional collaboration

Education

Bachelor’s degree in computer science or similar technical qualification

Tools

AWS CloudWatch

Datadog

Grafana

Prometheus

Jira Service Management

Kablamo is a fast-growing cloud digital product development company. Founded in 2017 in Australia, the business has grown quickly over the last several years, including the expansion of the team to Canada in 2021. We are proud to have assembled an amazing list of customers, including some of the best known enterprise and government organizations, in Australia and Canada. We’re looking to further accelerate our growth in both markets, and we’re seeking a Sr. Site Reliability Engineer to help us support new products to market.

Kablamo is proud to be an Advanced AWS Consulting partner, and we have recently been recognised as a global leader in designing and building cloud-based data and AI / ML solutions. At the 2021 AWS Global Public Sector conference, Kablamo won the award for “Most Innovative AI / ML Solution” for our work building bushfire prediction data platforms in Australia – we were selected from more than 1,800 AWS global partners.

The Role

As we expand the capability across our Product Care offering, we are looking for a Sr. Site Reliability Engineer (SRE) to help us build our capability and deliver insights from massive scale data in real time. The Sr. SRE role is responsible for developing automated solutions for operational aspects such as on-call monitoring, performance and capacity planning, and disaster response. The role will complement our ongoing development teams, looking at continuous delivery and infrastructure automation.

As the bridge between development and operations, you will be our primary escalation point across key customer accounts.

Key Responsibilities :

Contribute to the design, implementation, and maintenance of our AWS infrastructure
Be proactive in anticipating production issues. Assess risks and mitigate against these, planning for contingencies and counter-measures in advance
Ensuring reliability to get systems back to a steady state by quickly investigating and fixing performance, stability and scalability issues, ensuring Kablamo is able to meet SLA and SLO requirements
Responsible for ensuring that the underlying infrastructure is running smoothly and that systems and tools are working as expected. You will be assessing risks and mitigating against these or planning appropriate contingencies and counter-measures in advance
Develop or implement visual tools for technical and business teams to observe system health and supporting the Technical Account Manager in reporting on reliability metrics
Use automation tools to solve problems, writing and developing code to automate processes, such as analysing logs and testing production environments
Working with the engineering and / or development team to identify recurring problems which can be resolved through automation
Responsible for enhancing performance, efficiency and monitoring of software development processes
Act on system incidents; as the SRE you are a key contact involved in incident response and resolutions including active collaboration in any PIRs / Post-mortems
Collaborate closely with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. Actively collaborating with the development team to define fields for logging and tracing.
Being a voice to advocate for reliability against competing priorities
Helping prepare activities for production release, including facilitating training and enablement of client technical teams and / or attending appropriate meetings (Technical Working Groups, Architecture Review Boards, Change Advisory Boards)

Required skills and experience :

5+ years’ experience in an SRE or DevOps role
Deep understanding of system architecture and design principles
Ability to think critically and problem solve, providing good performance under pressure
Troubleshooting experience with the ability to clearly communicate to customers or the engineering team
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
Experience with AWS and its services (Serverless, Deployment Tools, Networking, Containerization, Security, Cost Management)
Familiarity with tools such as AWS CloudWatch, Datadog, Grafana, Prometheus, Scalyr, PagerDuty, OpsGenie, Jira Service Management
Ability to work cross functionally with support engineering, development teams and / or client vendors to deliver sound outcomes and suggest system improvements
Understanding of security requirements and implications and can conform to applicable security frameworks
An in-depth knowledge of version control
Experience with production rollback
Knowledge of fundamental network concepts and protocols
A good understanding of DevOps concepts and best practices including Infrastructure-as-Code

Bonus Points for :

Bachelor’s degree in computer science or other similar technical qualification
AWS Associate and / or Professional Level Certifications
Strong grasp of networking, security, and reliability fundamentals
Solid understanding of Agile methodologies and practices
Lead SRE

Hiring Process :

30-min intro chat with our TA team
1-hr Technical interview
1-hr Final Interview
References
Offer!

Why Work at Kablamo?

Our Culture

We acknowledge a workplace that is diverse and inclusive, enables for greater innovation and produces benefits including improved performance, improved employee happiness and wellbeing, and superior outcomes for our customers. We attribute our success to all our unique and charismatic Kablamites. Through our fortnightly back to base and our debate Thunderdomes, we enable our Kablamites to provide feedback, share ideas, challenge the status quo and technically challenge each other constructively.

The PERKS!!!

Remote first with a downtown Toronto office available
Work abroad for up to 3 weeks per year (some restrictions apply)
Career growth (we really do promote from within!)
Online rewards platform
Paid birthday leave
Anniversary bonus
Referral bonus
Parental Leave top up
Employee Assistance Program
Swag

Kablamo is a proud equal opportunity employer. We make our hiring decisions solely based on your skills and experience, as well as the perspectives and value you can bring to our team. Kablamo believes that diversity is vital to provide the best service to our clients and we are committed to fostering a varied and inclusive work environment. Every effort to accommodate candidates for accessibility will be made upon request. Information received related to accommodations will be addressed confidentially.

Kablamo would like to thank all candidates for their interest however only qualified applicants will be shortlisted.

Role Type

Company Overview

Are you interested in joining one of Australia’s best cloud product development companies? Our team uses cutting-edge cloud technology to design and build digital products and data platforms that deliver transformational change. We’re helping our customers to build digital solutions to manage bushfire risk, perform genomics research on deadly diseases, launch new fintech, deliver millions of hours of media content to viewers, rethink welfare programs for disadvantaged communities, and much more. At the 2021 AWS Global Public Sector conference, Kablamo won the global award for “Most Innovative AI / ML Solution” – we were selected from more than 1,800 AWS global partners! The AWS award was for Kablamo’s work with Victoria’s Department of Environment, Land, Water & Planning to help them predict and manage bushfire risk for the State of Victoria.

J-18808-Ljbffr

Create a job alert for this search

Site Reliability Engineer • Toronto, ON, Canada

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.