Enable job alerts via email!

System Reliability Engineering Lead

CGI

Toronto

On-site

CAD 100,000 - 130,000

Full time

3 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

As a System Reliability Engineer at a leading IT services firm, you’ll design and support developer productivity tools, utilizing a range of technologies such as Terraform and Ansible. The role demands strong software engineering skills and a collaborative spirit to enhance operations and user experience in a dynamic environment.

Qualifications

10 years experience in software engineering, development, or system operations.
Strong in Linux internals and performance tuning.
Experience automating deployments using Terraform, Ansible, and Python.

Responsibilities

Keep the developer productivity platform up and running.
Automate infrastructure deployment and debugging.
Lead team meetings and root cause analysis sessions.

Skills

Linux

Shell scripting

DevOps engineering

Automation

Networking principles

Monitoring tools

Cloud platforms

Kubernetes

Tools

Terraform

Ansible

GitLab

SonarQube

Artifactory

We are Canada's largest independent information technology services firm, and after 40 years, we're still growing! Innovation, technology, and service delivery are our focus. Our goal is to ensure our clients remain ahead of the competition. We provide a full spectrum of managed services from IT and business process outsourcing to systems integration and consulting that are transforming our clients’ operations and helping them to succeed.

Do you enjoy working with a highly motivated and talented team to deliver mission critical developer tooling? We are currently expanding our System Reliability Engineering team that helps one of our key clients deploy, manage, troubleshoot, and enhance their developer tooling platform, servicing over developers.

Your future duties and responsibilities :

As a System Reliability Engineer, you will be responsible for designing, implementing, and supporting a verity of developer productivity tools that include Ansible Tower, GitLab, Artifactory and SonarQube. The technology stack used to manage the platform includes Ansible, Terraform, Python, Prometheus, Splunk, and ELK.

You will build automation solutions to provision and validate infrastructure and help debug and resolve problems. You will help to improve operational performance by focusing on user experience, effectively assessing and managing risk, and minimizing the impact of failures.

Required qualifications to be successful in this role :

Responsibilities

Keeping all components of the developer productivity platform up and running
Working closely with internal partners and platform users to ensure that all services meet security, SLA, and performance requirements
Writing, updating, and using documentation, including runbooks and playbooks
Automating infrastructure deployment, testing, application failover, failure mitigation, user self-service functions, and more
Debugging complex problems across the entire stack
Participating in various meetings with the Operations and Delivery teams.
Lead Daily / Weekly Meetings to discuss the overall health of the systems.
Leading Root Cause Analysis calls
Propose and implement Monitoring Improvements / Optimization and Automation Opportunities
Take part in PI (Program Increment) Planning sessions

Key Skills and Attributes

10 years experience with software engineering, software development, or system operations
Experience working with Linux and can write shell scripts and understands Linux internals and performance tuning
Strong understanding of networking principles
Experience in building, implementing, and supporting highly available production systems
Experience automating infrastructure and deployments using Terraform, Ansible, and Python or equivalent technologies
Understanding of DevOps engineering, CI / CD, and software deployment
Working knowledge of developer tooling such as Artifactory, GitLab, SonarQube, and Ansible Tower
Experience with various monitoring and observability tools
Experience deploying and managing workloads on one of the major public cloud platforms, private clouds such as OpenStack
Experience deploying and managing workloads on one of the major container management platforms like Kubernetes, OpenShift, PCF or Rancher
A curiosity about how complex socio-technical systems operate and what happens during failure

It’s not expected that any single candidate would have experience across all these areas – we are looking for someone who is strong in a few areas and has interest and curiosity in others.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.