Enable job alerts via email!

Site Reliability Engineer

Upsun

Canada

Remote

CAD 80,000 - 120,000

Full time

4 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading tech company is seeking a Site Reliability Engineer to enhance the reliability and scalability of their cloud-based systems. This role involves transitioning traditional operations to a more automated SRE approach, focusing on improving infrastructure and streamlining processes. The ideal candidate will have a solid understanding of DevOps principles and hands-on experience with cloud platforms and automation tools. Participate in a vibrant remote team dedicated to innovation and collaboration.

Benefits

Flexible PTO

Company stock options

Professional development budget

Internet reimbursement

Remote work travel program

Qualifications

Experience with DevOps, Cloud Operations, or SRE.
Proficient in scripting languages like Python or Bash.
Hands-on experience with cloud platforms like AWS, GCP, or Azure.

Responsibilities

Refine monitoring with tools like Prometheus and Grafana.
Automate deployments using IaC tools like Terraform.
Collaborate with cross-functional teams on reliability practices.

Skills

DevOps principles

Cloud Operations

Scripting skills

Problem-solving

Education

Degree in Computer Science or related field

Tools

Docker

Kubernetes

AWS

Grafana

Terraform

Platform.sh is Platform-as-a-Service (PaaS) that removes the complexities of cloud infrastructure management and optimizes development-to-production workflows, reducing the time it takes to build and deploy applications. Delivering efficiency, reliability, and security, giving development teams both control and peace of mind. Built for developers, by developers.

Adopted and loved by 16,000+ developers, 7,000 customers, and for nearly a decade Platform.sh has been providing innovative capabilities that serve as the launchpad for creative development teams’ out-of-the-box thinking.

We provide 24x7 support, managed cloud infrastructure, and automated security and compliance with an all-in-one PaaS. We give our customers complete control over their data by keeping applications secure and available around the clock.

Platformers are a remote, global workforce, and we thrive in a multicultural team. We are committed to open source and an open, welcoming environment. Our team spans the globe and the experience spectrum. What's our commonality, our cultural fabric? A curious spirit and a thirst for knowledge; an eagerness for innovative ideas and cultures. We believe we can build anything together in an environment that frees you to do your best work.

Bring your expertise and enthusiasm to our growing, global organization. Your contributions, collaboration, and unique point of view are recognized and valued here.

Impact of a Site Reliability Engineer

As a Site Reliability Engineer, you are a key part of our team’s transition to the Site Reliability Engineering (SRE) model, moving from traditional Cloud Operations to an automation-driven approach. This shift enhances system reliability, scalability, and efficiency, positioning SRE as a core function within the company.

Moreover, in this role, you focus on improving infrastructure, automating operational tasks, and streamlining processes. You work closely with developers, engineers, and product teams to ensure reliability is embedded throughout the application lifecycle.

As part of this transition, you also help optimize cloud-based systems, reduce manual work, and drive continuous improvements, playing a vital role in the organization’s overall success and long-term stability.

What to expect

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

What you bring

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

Where we hire

At Platform.sh, remote work isn't just a trend - it's our foundation. The freedom of remote work with the support of a diverse, global team has been our successful model for nearly a decade. Our culture celebrates flexibility and collaboration, and while we have team members in over 30 countries around the globe, we are currently focused on hiring for this role in

About Platform.sh

Bring your expertise and enthusiasm to our growing, global organization. Your contributions, collaboration, and unique point of view are recognized and valued here.

Impact of a Site Reliability Engineer

What to expect

Refine Monitoring and Observability: Enhance system monitoring with tools like Prometheus, Grafana, and ELK Stack, ensuring visibility and alignment with business objectives.
Automate Deployments and Workflows: Transition manual processes to automated solutions using IaC tools (e.g., Terraform, Ansible) to streamline deployments and improve operational efficiency.
Optimize CI/CD Pipelines: Improve pipeline architecture for fast, reliable releases, ensuring scalability and resilience to handle high volumes of changes.
Cloud Infrastructure Management: Help scale cloud-based systems on platforms like AWS, GCP, and Azure while minimizing technical debt and operational complexity.
Incident Response and Post-Mortem: Support incident management and lead post-mortem analysis, ensuring continuous improvement and knowledge sharing.
Collaborate with Cross-Functional Teams: Work closely with engineering and product teams to integrate reliability practices into the development lifecycle and prioritize reliability efforts.
Drive Technical Innovation: Introduce and champion new tools, technologies, and practices that improve system reliability, performance, and scalability.

What you bring

DevOps, Cloud Operations, or SRE Expertise: A solid understanding of DevOps, Cloud Operations, or SRE principles, with a focus on reliability and scalability.
Advanced Linux Internals Expertise: Hands-on experience with Linux systems, including performance tuning, kernel configurations, and troubleshooting.
Programming Languages: Proficiency in programming languages such as Go (preferred) or Python, with a focus on building tools and automating processes.
Scripting Skills: Strong skills in scripting languages like Python, Bash, or Go to automate workflows, streamline tasks, and manage infrastructure.
Cloud Infrastructure Knowledge: Extensive experience with cloud platforms like AWS, GCP, and Azure, along with expertise in monitoring/logging frameworks and CI/CD pipelines.
Containerization and Orchestration: Hands-on experience with Docker, Kubernetes, and other containerization technologies for building and deploying scalable applications is a nice to have.
Problem-Solving and Collaboration: Strong problem-solving skills, system design experience, and the ability to collaborate effectively across teams.

Where we hire

Canada - Please note that this position requires participation in an on-call rotation aligned with the Pacific Time Zone (PT). Although we’re unable to provide visa sponsorship at this time, we welcome applications from all qualified candidates who are legally authorized to work in these countries.

How we hire

We know that a great hire won’t meet every requirement that we’ve outlined. If you can see yourself elevating the team, we want to hear your story. Few of us would be here had we not taken a chance.

You can expect 4 interviews on Google Meet to follow the order below. Should you successfully move through the entire process you will have the opportunity to meet with a variety of Platformers. Our goal is to ensure you can make the most informed decision on whether this role, and our culture aligns with what you’re looking for in your future working environment.

45 Minutes with Talent Acquisition

60 Minutes with Hiring Manager (Director, Site Reliability Engineering)

60 Minutes with Team (Site Reliability Engineer, Director, Site Reliability Engineering)

60 Minutes with Executive (Senior Director, Site Reliability Engineering)

All roles require background checks.

What we offer

A product you can believe in - Join us in transforming how businesses build and manage web applications, driven making a positive impact as a proud B Corp.

An Award-Winning Workplace - We’ve been recognized by Forbes’ Top 30 Companies for Remote Jobs and France’s Best Workplaces for Women.

️ A culture that values your voice - Join a flexible, open, and inclusive work environment where your voice is encouraged, and your ideas shape our growth and evolution.

A global team - Collaborate with colleagues from diverse backgrounds across the world, embracing different perspectives

Benefits and perks - Make the most of what matters to you

Flexible PTO

Company stock options

Professional development budget

Office equipment budget

️ Wellness budget

Annual team gatherings

???? Internet reimbursement

Inclusive parental leave

️ Remote work travel program

You belong here

At Platform.sh, we celebrate diversity in all its forms and are committed to fostering an inclusive, equitable, and supportive workplace where everyone can thrive. We embrace and value different perspectives, backgrounds, and experiences, because they make us stronger as a team. Whoever you are, wherever you're from, and whatever path you've taken, you are welcome here. We encourage you to bring your whole self to work, connect with others, and share your passion.

If you need accommodations at any stage of our hiring process, please let us know. We're here to ensure an accessible and comfortable experience for you.

Apply for this job

First Name

Last Name

Phone

Resume/CV

Where in the world are you currently living (and legally ready to work)? Select your country from the list!

Palform as a Service or Cloud Application Platforms aren’t your typical SaaS. Please share your LinkedIn profile so we can dive deeper into your cloud journey or related experience!

Please share your total annual compensation expectations and the currency you are referring to.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs