Enable job alerts via email!

Site Reliability Engineer

ALLTECH CONSULTING SVC INC

Quebec

On-site

CAD 90,000 - 130,000

Full time

17 days ago

Job summary

A leading consulting service company is seeking a Site Reliability Engineer to enhance operational support and reliability engineering for critical products. The role aims to maximize developer productivity by implementing effective systems and processes within a broad development environment. Candidates who may be transitioning from software development are encouraged to apply as strong operational skills with automation experience like Python will be valuable.

Qualifications

  • Strong Linux troubleshooting skills required.
  • Experience in automation, preferably with Python.
  • Familiarity with CI/CD and deployment tools essential.

Responsibilities

  • Maximize system availability and performance through automation.
  • Collaborate with other SREs and troubleshoot complex issues.
  • Participate in an on-call rotation for operational support.

Skills

Linux troubleshooting
Automation
Monitoring tools
Collaboration
Communication

Tools

Prometheus
Grafana
Docker
Kubernetes

Job description

Job Description:

Technology/Role/Department at our Company Enterprise Technology & Services (ETS) delivers shared technology services for the Firm supporting all business applications and end users. ETS provides capabilities for all stages of the Firm’s software development lifecycle, enabling productive coding, functional and integration testing, application releases, and ongoing monitoring and support for over 3,000 production applications. ETS also delivers all workplace technologies (desktop, mobile, voice, video, productivity, intranet/internet) in integrated configurations that boost the personal productivity of our employees.

Application and end user services are delivered on a scalable, secure, and reliable infrastructure composed of seamlessly integrated datacenter, network, compute, cloud, storage, and database services. Application Infrastructure (AI) strives to maximize the business application developers’ productivity by centrally providing the core development lifecycle tools, core reusable software libraries and middleware thus minimizing duplicative efforts across silos. We are also focusing on the lifecycle into production and provide tooling to monitor systems, applications, hosts, logs and infrastructure inventory.

Our goal is to provide infrastructure that is broadly reusable, scalable, reliable and highly performant to meet the demanding needs of our applications.

Job Responsibilities: The Company’s Development Environment department is seeking a Site Reliability Engineer to drive reliability engineering, operational support, and customer consultation services for key products. MSDE is part of the Application Infrastructure organization and is responsible for shaping the SDLC within the Company by implementing the tools, systems, and processes used by 17,000+ developers for software development and deployment.

Reporting to the SRE Lead for MSDEs Engineered products, this role requires growing SRE capabilities to deliver reliable systems efficiently and understanding MSDEs products thoroughly to maximize developer productivity across the Firm.

This is a production-side, operational role requiring participation in an on-call rotation and strong influencing skills among technical stakeholders. Much of the daily operations can be delegated to team ops staff.

The successful candidate may be a Python developer aiming to evolve into reliability engineering or a strong operational lead with Python experience. Prior experience in finance is not required; candidates from software or other industries are welcome.

Job Responsibilities:
• Building and maintaining comprehensive knowledge of the Company’s development environment
• Maximizing system availability and performance through automation, problem management, and architecture reviews
• Reducing support costs via operational issue elimination, automation, operational tool development, and client self-service
• Identifying and prioritizing technical debt impacting productivity, reliability, or support efficiency
• Collaborating with other SREs to share solutions
• Troubleshooting complex environment issues
• Enhancing Ops team knowledge and support capabilities to reduce escalations
• Consulting with development teams to improve productivity and troubleshoot issues
• Experimenting with new tools and techniques
• Sharing on-call responsibilities within the global team
Required Qualifications / Skills:
• Strong Linux troubleshooting skills
• Automation experience in any language, preferably Python
• Experience with monitoring/observability tools like Prometheus and Grafana
• Familiarity with version control, issue tracking, CI/CD, automated testing, and deployment automation tools
• Excellent communication and collaboration skills
Desired Skills:
• Knowledge of SRE practices like SLOs, error budgets, blameless postmortems, toil reduction
• Experience with Docker/Kubernetes
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.