Enable job alerts via email!

High Performance Computing Engineer

Boson AI

Toronto

On-site

CAD 150,000 - 250,000

Full time

6 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Boson AI is seeking a Senior High Performance Computing Engineer to manage GPU clusters and related infrastructure in Toronto. The role involves a range of responsibilities including configuring systems and managing on-premises Data Center operations. Candidates should possess a strong background in high-performance computing and programming skills, particularly in Python, to effectively deploy and maintain production-grade machine learning systems.

Qualifications

  • Strong background in high-performance computing.
  • Experience managing large hardware clusters.
  • Proficient in at least one programming language (e.g., Python).

Responsibilities

  • Manage private large high-end GPU clusters.
  • Handle full lifecycle of physical systems including deployment, operations, triage, and troubleshooting.
  • Automate configuration of on-premises Linux-based systems using infrastructure-as-code practices.

Skills

Problem-solving
High-performance computing
Programming (Python)
Data Center operations
GPU optimization

Job description

1 month ago Be among the first 25 applicants

This range is provided by Boson AI. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

CA$150,000.00/yr - CA$250,000.00/yr

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML, and Statistics scientists and engineers are working on high-quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior High Performance Computing Engineer to help operate GPUs, network, and filesystem in our datacenter deployment in Toronto. The ideal candidate should have strong problem-solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA DeepOps, Ethernet networking, and related tools is a big plus. You should be comfortable performing hardware configuration.

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking, and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:
  • Manage private large high-end GPU clusters
  • Handle full lifecycle of physical systems including deployment, operations, triage, and troubleshooting
  • Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
  • Configure and maintain MAAS, Ceph, Slurm, and Kubernetes
  • Automate configuration of on-premises Linux-based systems using infrastructure-as-code practices
  • Configure and maintain network, e.g., Layer 3 networking
  • Learn about new tools and deploy them
You might be a great fit if you have:
  • Strong background in high-performance computing
  • Experience with on-premises Data Center operations and technologies
  • Experience managing large hardware clusters
  • Proficiency in at least one programming language (e.g., Python) and ability to write clean, maintainable code
  • Experience designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience managing firmware and system updates for hardware, e.g., SuperMicro

The ability to solve problems and learn new techniques is key.

Seniority level

Not Applicable

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Hospitality, Food and Beverage Services, and Retail

Referrals increase your chances of interviewing at Boson AI by 2x

Get notified about new Performance Engineer jobs in Toronto, Ontario, Canada.

Toronto, Ontario, Canada CA$50,000.00-CA$70,000.00 3 weeks ago

Co-Op/Intern, Software Development - Fall 2025

Mississauga, Ontario, Canada CA$125,000.00-CA$130,000.00 4 hours ago

Freelance Software Developer (Python) - AI Trainer
Vehicle Systems Engineer – Performance and Benchmarking
Freelance Software Developer (C++) - AI Trainer
Director/Principal Data Engineer, AI & ML Enablement
Associate Performance Engineer - 2025 Start Dates

Mississauga, Ontario, Canada 10 hours ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Deep Learning Engineer

Numerator

Toronto

On-site

CAD 150,000 - 250,000

3 days ago
Be an early applicant

High Performance Computing Engineer

Boson AI

Toronto

On-site

CAD 150,000 - 250,000

8 days ago

High Performance Computing Engineer

Boson AI

Toronto

On-site

CAD 150,000 - 250,000

12 days ago