Enable job alerts via email!

IT Infrastructure Manager – HPC Environment

Groom & Associates

Montreal

Hybrid

CAD 80,000 - 100,000

Full time

Today
Be an early applicant

Job summary

A renowned research institute in Montreal is seeking an IT Infrastructure Manager to lead its HPC environment. This role requires strategic planning, deep expertise in HPC operations, and ability to manage a team. The ideal candidate will have over 10 years in IT infrastructure, including significant leadership experience. This opportunity includes competitive compensation and a comprehensive benefits package.

Benefits

4 weeks vacation
Comprehensive benefits package
RRSP matching program

Qualifications

  • 10+ years of IT infrastructure experience, with 5+ in leadership roles.
  • Deep expertise in HPC cluster design and operations.
  • Proven experience managing data centers, networks, and storage.

Responsibilities

  • Develop and execute an infrastructure roadmap.
  • Oversee design and optimization of HPC clusters.
  • Lead procurement processes for infrastructure components.

Skills

HPC cluster management
Leadership
Budget management
Virtualization
Bilingual (English & French)

Education

Bachelor’s or Master’s in Computer Science, Engineering, or related field

Tools

Slurm
InfiniBand
Docker
Ansible
Job description
Overview

IT Infrastructure Manager – HPC Environment

Location: Montreal, Quebec (Hybrid: 2 days remote work allowed)

Type: Full-time, permanent

Compensation: Competitive and aligned with the market. Includes 4 weeks vacation, a comprehensive benefits package, and an RRSP matching program (with options to exchange for additional vacation or income).

The Organization

Our client is a globally recognized research institute dedicated to advancing artificial intelligence and machine learning. Known for pioneering contributions in areas such as deep learning, reinforcement learning, natural language processing, and generative models, the organization brings together leading researchers, students, and partners. Its mission is to serve as a global hub for scientific progress, fostering innovation in AI for the benefit of society.

The Role

The organization is seeking a highly experienced and visionary IT Infrastructure Manager to lead and evolve its mission-critical computing environment. This leader will be responsible for the strategic planning, design, implementation, and operation of advanced high-performance computing (HPC / AI) clusters, data centers, and network infrastructure. The successful candidate will play a pivotal role in ensuring that researchers and students have access to cutting-edge computing resources to push the boundaries of AI innovation.

Responsibilities
  • Strategic Leadership: Develop and execute an infrastructure roadmap aligned with research goals and emerging technologies.
  • HPC Cluster Management: Oversee the design, deployment, maintenance, and optimization of HPC clusters, ensuring availability, performance, and scalability.
  • Vendor & Procurement: Lead procurement processes (RFPs) for HPC clusters and infrastructure components, balancing cost-effectiveness with technical requirements.
  • Team Leadership: Mentor and manage a team of skilled engineers and administrators.
  • Operations & Reliability: Define best practices for monitoring, troubleshooting, and incident response to guarantee reliability.
  • Budget Oversight: Manage infrastructure budgets.
  • Security & Compliance: Implement strong security and compliance protocols across all infrastructure components.
  • Collaboration: Partner with researchers, faculty, and departments to deliver tailored computing solutions.
  • Innovation: Stay at the forefront of infrastructure and AI hardware trends to propose and deploy innovative solutions.
Qualifications
  • Bachelor’s or Master’s in Computer Science, Engineering, or a related field.
  • 10+ years of IT infrastructure experience, with 5+ in leadership roles.
  • Deep expertise in HPC cluster design and operations (Slurm, InfiniBand, Lustre, BeeGFS).
  • Proven experience managing data centers, networks, and storage systems.
  • Strong knowledge of virtualization (Proxmox, Docker, Podman).
  • Experience with Infrastructure as Code (Ansible, Terraform) and automation.
  • Excellent leadership, communication, and interpersonal skills.
  • Ability to manage projects and priorities in a dynamic research environment.
  • Strong bilingual communication (English & French).
Desirable Skills
  • Experience with GPU-accelerated computing and deep learning frameworks.
  • Familiarity with research computing environments and academic workflows.
  • Contributions to open-source communities.

Votre partenaire en recrutement – Your recruitment partner

Are you interested in this job?

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.