Enable job alerts via email!

Senior Operations Manager - AI Infrastructure

Hypertec Group

Toronto

On-site

CAD 95,000 - 120,000

Full time

Yesterday
Be an early applicant

Job summary

A leading technology firm in Toronto is seeking a Senior Operations Manager to oversee AI infrastructure operations. The role entails managing a cross-functional team, ensuring the performance and scalability of GPU clusters, and driving DevOps practices. The ideal candidate has over 7 years of experience in IT operations and strong leadership skills. You will play a key role in shaping next-generation computing environments for AI research.

Qualifications

  • 7+ years in infrastructure or IT operations, with 3+ in leadership.
  • Experience managing high-performance computing environments.
  • Strong expertise in Linux systems and automation tooling.

Responsibilities

  • Manage a cross-functional team to ensure operational efficiency.
  • Oversee scaling and lifecycle management of AI clusters.
  • Drive adoption of Infrastructure-as-Code practices.

Skills

Team leadership
Infrastructure operations
DevOps practices
System reliability
Strategic planning
Collaboration with Networking teams

Tools

Terraform
Ansible
Kubernetes
Docker
Job description
Overview

Senior Operations Manager - AI Infrastructure at Hypertec Group. 5C is seeking a strategic, hands-on leader to deploy and operate large-scale AI infrastructure.

Mission

You will manage a cross-functional team of System Administrators, DevOps Engineers, and Support Specialists, ensuring reliability, performance, and scalability across high-performance compute clusters. The ideal candidate has direct experience managing GPU / TPU-based AI clusters and will collaborate with Networking teams and System Architects to align day-to-day operations with system design, network performance, and long-term infrastructure strategy.

What You’ll Be Contributing
  • Team Leadership & Development: Lead, mentor, and grow a high-performing technical team. Set clear goals, track performance, and foster a culture of accountability and continuous improvement.
  • Infrastructure Operations: Oversee deployment, scaling, and lifecycle management of GPU-based AI clusters across on-premises and cloud environments. Ensure infrastructure performance, resilience, and cost efficiency for compute-intensive AI workloads. Partner with Networking teams to optimize high-bandwidth, low-latency connectivity. Work closely with System Architects to deliver scalable, maintainable infrastructure aligned with long-term goals.
  • DevOps & Automation: Champion Infrastructure-as-Code (IaC) practices for automated provisioning, configuration, and monitoring. Drive adoption of CI / CD pipelines for reliable infrastructure and model deployments.
  • System Reliability & Support: Maintain system performance, security, and availability through proactive monitoring, patching, and support. Lead incident response, root cause analysis, and continuous improvement initiatives.
  • Strategic Planning & Budgeting: Support capacity planning and roadmap development to meet future compute demands. Manage budgets for hardware procurement, cloud services, and licensing.
  • Compliance & Security: Partner with Security and Networking teams to implement access controls, monitoring, and compliance standards. Ensure adherence to regulatory and internal security policies.
What Sets You Apart

Required

  • 7+ years in infrastructure or IT operations, with 3+ years in a leadership role.
  • Proven experience managing high-performance computing environments (GPU / TPU clusters).
  • Strong expertise in Linux systems, distributed systems, and automation tooling.
  • Track record of collaboration with Networking teams for performance and security.
  • Experience aligning with System Architects to deliver scalable infrastructure.
  • Proficiency with DevOps tools (Terraform, Ansible, Kubernetes, Docker, CI / CD pipelines).
  • Familiarity with cloud platforms (AWS, GCP, Azure) and hybrid infrastructures.
  • Excellent leadership, communication, and organizational skills.

Preferred

  • Direct experience managing AI clusters for deep learning training / inference.
  • Background in AI / ML, data science, or high-throughput data processing.
  • Experience with HPC and workload schedulers (Slurm, Kubernetes).
  • Relevant certifications (cloud, networking, DevOps).
  • Knowledge of observability tools (Prometheus, Grafana, ELK stack).
Why Join Us

At the forefront of AI infrastructure innovation, you’ll play a pivotal role in scaling next-generation compute environments that power cutting-edge AI research and applications. This is an opportunity to lead high-impact operations in a rapidly growing, collaborative environment.

Note to Applicants

This recruitment is being managed by Hypertec on behalf of our partner organization, 5C. If selected, you will be hired directly by the partner company and will be joining their team as an HR Manager. We are supporting them in identifying top talent to help scale their people operations during a period of exciting growth.

About 5C Group

5C Group is a next-generation AI Digital Infrastructure provider established from the acquisition of 5C Data Centers by Hypertec Cloud. With over 2 gigawatts (GW) of roadmap capacity and the ability to power hundreds of thousands of GPUs, 5C Group delivers secure, reliable, and sustainable data center and AI infrastructure solutions at scale to the largest and most demanding AI users. For more information, please visit www.5c.ai.

Seniority level: Mid-Senior level

Employment type: Full-time

Job function: Management and Manufacturing

Industries: Computer Hardware Manufacturing

Referrals increase your chances of interviewing at Hypertec Group by 2x

Get notified about new Senior Operations Manager jobs in Toronto, Ontario, Canada.

J-18808-Ljbffr

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.