Enable job alerts via email!

High Performance Computing Engineer

Boson AI

Toronto

On-site

CAD 150,000 - 250,000

Full time

8 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Boson AI is seeking a Senior High Performance Computing Engineer to support and operate advanced GPU and network systems essential for their cutting-edge AI technologies. This full-time role in Toronto involves managing high-performance computing clusters and ensuring optimal operation of infrastructure technologies, making it an ideal position for those passionate about optimizing machine learning systems.

Qualifications

  • Experience managing a large hardware cluster.
  • Ability to configure and maintain network switches.
  • Familiarity with GPU utilization for machine learning workloads.

Responsibilities

  • Manage private large high-end GPU clusters.
  • Responsible for full lifecycle of physical systems.
  • Automate on-premises Linux-based systems at scale.

Skills

Problem Solving
Strong background in high performance computing
Proficiency in Python
Experience with Data Center operations

Tools

Slurm
MAAS
Ceph
NVIDIA DeepOps
Infiniband

Job description

6 days ago Be among the first 25 applicants

Get AI-powered advice on this job and more exclusive features.

This range is provided by Boson AI. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

Base pay range

CA$150,000.00/yr - CA$250,000.00/yr

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration.

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

  • Manage private large high-end GPU clusters
  • Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting
  • Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)
  • Configure and maintain MAAS, Ceph, Slurm and Kubernetes
  • Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices
  • Configure and maintain network, e.g. Layer 3 networking
  • Learn about new tools and deploy them


You might be a great fit if you have:

  • Strong background in high performance computing
  • Experience with with on-premises Data Center operations and technologies
  • Experience in managing a large hardware cluster
  • Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code
  • Experience in designing, deploying, and maintaining production-grade machine learning systems at scale
  • Familiarity with GPU utilization for machine learning workloads and optimization techniques
  • Experience with managing firmware / systems updates for systems, e.g. on SuperMicro


The ability to solve problems and to learn new techniques is key.

Seniority level
  • Seniority level
    Not Applicable
Employment type
  • Employment type
    Full-time
Job function
  • Job function
    Engineering and Information Technology
  • Industries
    Research Services

Referrals increase your chances of interviewing at Boson AI by 2x

Get notified about new Performance Engineer jobs in Toronto, Ontario, Canada.

Toronto, Ontario, Canada CA$150,000.00-CA$250,000.00 1 month ago

Vehicle Systems Engineer – Performance and Benchmarking

Toronto, Ontario, Canada CA$74,000.00-CA$80,000.00 4 weeks ago

Associate Performance Engineer - 2025 Start Dates
Associate Performance Engineer - 2025 Start Dates
Software Engineer I, Entry Level (Fall 2024-Spring 2025) - Toronto
Performance Engineer / Analyst (H/F) - SAFRAN LANDING SYSTEMS
Developer - Data, AI and Platform Development & Support (New or Recent Graduate)

Greater Toronto Area, Canada CA$120,000.00-CA$150,000.00 3 weeks ago

Toronto, Ontario, Canada $40,000.00-$60,000.00 1 month ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Deep Learning Engineer

Numerator

Toronto

On-site

CAD 150,000 - 250,000

3 days ago
Be an early applicant

High Performance Computing Engineer

Boson AI

Toronto

On-site

CAD 150,000 - 250,000

5 days ago
Be an early applicant

High Performance Computing Engineer

Boson AI

Toronto

On-site

CAD 150,000 - 250,000

12 days ago