Enable job alerts via email!

Site Reliability Engineer / Platform Operations Engineer

Targeted Talent

Vancouver, Winnipeg, Montreal

Remote

CAD 80,000 - 110,000

Full time

Today
Be an early applicant

Job summary

A leading global tech firm is seeking an experienced Site Reliability Engineer to lead projects and enhance operational responses. You will design Wargames, troubleshoot production issues, and mentor team members while managing AWS platforms. Ideal candidates will have strong troubleshooting, AWS, and Java expertise. This role offers competitive salary and great perks, initially remote with relocation to Calgary or Winnipeg.

Benefits

Competitive salary
Great perks

Responsibilities

  • Own development projects and deliver against the engineering roadmap.
  • Design and implement Wargames to test operational responses.
  • Act as technical escalation for SOC engineers during major incidents.
  • Troubleshoot and mitigate issues in production environments.
  • Mentor team members.
  • Operate global AWS Platforms at scale.

Skills

Troubleshooting
Problem-solving
Investigative skills
Experience of AWS
Java development
Incident management
Distributed web applications
Automating tasks
Data structures understanding
Mentoring
Identifying improvements

Tools

Ansible
Terraform
Python
ELK
Prometheus
Grafana
Job description

We are looking for an experienced Site Reliability Engineer or Platform Operations Engineer for our client. This is a permanent position that is remote to start with later relocation to CalgaryorWinnipeg. Our client is a global enterprise company with a product that you've likely used.

You Will:
  • Own development projects, providing technical guidance and delivering against the Platform & Service Operations Engineering roadmap.
  • Designing and Implementing Wargames to test our operational response and identify areas of weakness in our platforms.
  • Technical and Management Escalation point for Service Operations Centre (SOC) engineers and during major incidents.
  • Troubleshooting, reproducing and mitigating issues in our production environments
  • Mentoring other team members.
  • Operate global AWS Platforms at scale
You Have:
  • Evidence of Strong Troubleshooting, problem-solving and investigative skills
  • Experience of AWS or Other cloud providers
  • Experience developing in Java
  • Major incident management on experience operating production platforms at scale
  • Experience working with distributed web applications
  • Experience Automating operational tasks / Processes using other languages
  • Understanding of relational and/or NoSQL data structures
  • Experience mentoring/influencing peers
  • Identifying improvements, highlighting risks vs benefits, and translating them into technical requirements
Bonus:
  • Worked with Ansible, Terraform, Python
  • Experience working with Serverless / Containers
  • Experience of ELK &/Or Graphite/Prometheus / Grafana
  • Used Tracing Tools in production before
  • Experience in Chaos Engineering / Failure Injection Testing
  • Experience of working in an Agile Environment
  • Experience working in a similar site reliability role
This role offers great perks and a competitive salary, please apply to the job posting if it matches your career path!
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.