Job Search and Career Advice Platform

Enable job alerts via email!

Site Reliability Engineering Specialist (Hybrid)

PowerToFly

Montreal (administrative region)

Hybrid

CAD 80,000 - 110,000

Full time

Yesterday
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A financial services firm in Montreal is seeking a Site Reliability Engineering (SRE) Specialist to enhance performance and reliability within the Data Protection team. Responsibilities include optimizing system reliability and managing operational issues, while the ideal candidate should possess strong Linux and database troubleshooting skills as well as programming expertise in Python. A commitment to understanding complex systems and delivering self-healing solutions is essential. This position offers a hybrid work environment, blending remote work with office attendance.

Qualifications

  • Strong Linux troubleshooting skills.
  • Strong experience of database administration or troubleshooting.
  • Development skills in Python or similar for task automation.

Responsibilities

  • Maximize system availability and performance.
  • Identify and prioritize technical debt impacting reliability.
  • Complex troubleshooting in a Linux environment.
  • Deliver improved observability with performance metrics.

Skills

Linux troubleshooting
Database administration
Task automation (Python)
Excellent communication
Relationship building

Tools

Talend
Kafka
MQ
Job description

We're seeking someone to join our Data Protection Fleet as a Site Reliability Engineering (SRE) Specialist in Cyber to help drive performance, reliability, enhanced observability and efficiency for the department’s Data Obfuscation system. The Data Obfuscation team is responsible for implementing the systems, tooling and processes used by developers and engineers within the Firm for transferring large datasets across environments, whilst anonymizing specific fields, in order to provide an additional layer of protection against data leakage. The overall aim is to enable Morgan Stanley to increase release velocity whilst delivering at scale without compromising security or reliability.

Reporting directly to the Data Obfuscation Product Owner, this role requires delivering a range of SRE practices within a growing global squad of approximately 20 engineers. This means partnering with colleagues to deliver reliable, resilient systems without wasteful operational effort. SRE practices include task optimization and automation, prioritizing technical debt, observability and monitoring dashboards, problem elimination and incident response.

This is an operationally focused DevOps role requiring participation in an on-call rotation. The successful candidate might be a developer today looking to evolve site reliability as a practice, or an infrastructure specialist, or a strong system admin with some task automation experience. Linux and general database troubleshooting and hands‑on experience with Python to develop task automation are therefore essential to the role and an aptitude to learn or grow these skills is required. The role offers the opportunity to work on every aspect of our internally built systems, which includes connectivity to multiple database platforms (including Postgres, Snowflake, DB2, Sybase, MSSQL, MongoDB, etc.), Talend ETL, Java, Angular, Python, Shell, Web infra and a range of other technologies.

Prior experience in the financial industry is not required; candidates from software companies and other industries are welcome. We support and nurture our employees to grow in their role, so an aptitude and willingness to learn new skills or grow existing ones is encouraged, gaining proficiency in the role’s responsibilities over time, with the option to become a leader or role model in a technology or SRE practice area in the future.

What you’ll do in the role:
  • A commitment to understanding the range of products in our eco‑system with a view to specializing in at least one and optimizing the end‑to‑end workflow.
  • Maximizing the availability and performance of supported systems through optimized and automated plant management, enhanced observability, ongoing problem management and architecture reviews with peers.
  • Identification and prioritization of technical debt that is impacting system reliability, performance or squad efficiency, through the elimination of operational issues, optimization and automation of tasks, development of operational tools and driving client self‑service to minimize human dependency for support or maintenance.
  • Complex troubleshooting in a Linux environment with a focus on collaborating with others to identify the underlying cause of issues and agreeing on lasting improvements that can be made.
  • Exploring and delivering improved observability including performance metrics, actionable logging, tracing and meaningful alerting that can define and measure the target reliability of a product.
  • Being sensitive to clients’ needs (i.e. the Firm’s community of internal developers) to help maximize their productivity, including troubleshooting their issues and developing “self‑healing” solutions.
  • Minimizing the issue escalation rate to ensure the squad has the greatest possible flow of feature delivery.
  • Being dependable and operationally responsive during agreed hours, including sharing on‑call rotation with the rest of the global team (with a time‑off in lieu system).
What you’ll bring to the role:
  • Strong Linux troubleshooting skills.
  • Strong experience of database administration, engineering or troubleshooting (ideally including performance optimization).
  • Development skills in any programming language (ideally Python) for task automation.
  • Excellent oral and written communication.
  • Ability to establish effective relationships with colleagues and clients to collaborate on successful delivery and/or troubleshooting.
  • Ability to respond appropriately during occasional technical emergencies, such as outages.
Desirable skills:
  • Experience with data transfer technologies such as ETL (e.g. Talend, Informatica), Kafka, MQ, etc. – ideally Talend.
  • Software engineering or data engineering experience.
  • Experience of being an operational point of escalation.

All our positions are located in Montreal, Quebec. We offer a hybrid work environment, combining remote work and attendance in the office.

Knowledge of French and English is required.

Morgan Stanley is an equal opportunities employer. We work to provide a supportive and inclusive environment where all individuals can maximize their full potential.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.