Role
Site Reliability Engineer
Location
Toronto, ON
Duration
Contract
Responsibilities
- Site Reliability Engineering (SRE): Provide hands-on SRE support, including incident management, problem management, root cause analysis (RCA), monitoring, alerting, and infrastructure maintenance.
- Track, audit, monitor, and implement technical work streams.
- Act as Portfolio SME (Subject Matter Expert) to document common components, core functionalities, and infrastructure of supported applications.
- Serve as an escalation point in on-call rotation; support maintenance, scheduled work, and release deployment requirements.
- Lead incident and problem management for applications in scope and ensure RCA action items are fulfilled.
- Drive continuous improvement, technical standards, and automation opportunities in monitoring, tooling, and productivity.
- Manage technology currency, including server patching, certificate renewal, and compliance.
- Research and implement best-in-class technical solutions relevant to RBC environment and needs.
- Collaborate with unit, department, and enterprise teams to develop cross-enterprise solutions.
- Engineering: Develop SRE solutions such as monitoring and alerting systems, machine learning anomaly detection, self-healing, and reliability testing.
- Apply design-thinking and agile practices alongside SREs, Scrum Masters, and Incident Leads.
- Contribute to and leverage SRE best practices.
- Simplify development by building repeatable solutions to manual tasks.
- Promote adoption of automation solutions for applications in scope.
- Production Support: Perform production support roles, including off-hours support and rotational on-call duties.
- Assist in incident and problem management for applications in scope.
- Evaluate and improve processes to prevent future issues.
- Ensure availability and uptime of applications as per Service Level Objectives (SLOs).
- Ensure compliance, including segregation of duties.
- Technical Consultation: Provide guidance for initiatives beyond the application or squad level.
- Consult on product builds for other teams within RBPT and enterprise-wide.
- Innovation and Learning: Stay updated on technology changes through formal training and self-learning.
- Demonstrate new technology findings via team demos.
Must-Have Qualifications
- Bachelor's degree in Computer Science, Engineering, Mathematics, Physics, or equivalent practical experience.
- years of experience in SRE or related fields.
- Advanced knowledge and hands-on experience with: Programming & Scripting: Python, YAML, Shell scripting; Cloud & OS: Azure, Linux; Monitoring & Observability: Dynatrace, Prometheus, PagerDuty, Moog, Splunk, Elastic, Azure Monitor; Reliability Practices: Chaos Engineering; Messaging Systems: MQ, Kafka; Automation Tools: Ansible, Azure Automation, Catchpoint; Production support including off-hours and on-call rotations
Additional Experience (Less Than Year)
- Dynatrace
- Kafka
- Network programming (Perl, Python, Java, etc.)
- Microsoft Azure