Enable job alerts via email!

Machine Learning Software Engineer - GPU/NPU

Huawei Technologies Canada Co., Ltd.

Burnaby

On-site

CAD 78,000 - 168,000

Full time

2 days ago

Be an early applicant

Job summary

A leading technology company in Vancouver is looking for a Machine Learning Software Engineer. The role includes profiling and optimizing ML workloads and collaborating with teams to enhance performance. Ideal candidates hold a Master or PhD in a relevant field, with strong experience in ML systems and fluency in Python and C++. The salary ranges from $78,000 to $168,000 based on qualifications and experience.

Qualifications

Solid experience in ML systems or performance engineering.
Ability to turn ambiguous perf problems into measurable, repeatable experiments.
Contribution to relevant OSS is a plus.

Responsibilities

Profile and optimize end-to-end ML workloads.
Identify compute, memory, and bandwidth bottlenecks.
Collaborate with compiler/runtime teams for improvements.

Skills

Fluency in Python

Fluency in C++

Hands-on with CUDA/ROCm

Practical knowledge of PyTorch

Profiling skills (Nsight, perf)

Education

Master or PhD in Computer Science or related fields

Tools

CUDA

OpenCL

TensorFlow

Triton

Overview

Huawei Canada has an immediate 12-month contract opening for a Machine Learning Software Engineer.

About the team

The Software-Hardware System Optimization Lab continuously improves the power efficiency and performance of smartphone products through software-hardware systems optimization and architecture innovation. We keep tracking the trends of cutting-edge technologies, building the competitive strength of mobile AI, graphics, multimedia, and software architecture for mobile phone products.

About the job

Profile and optimize end-to-end ML workloads and kernels to improve latency, throughput, and efficiency across GPU/NPU/CPU.
Identify bottlenecks (compute, memory, bandwidth) and land fixes: tiling, fusion, vectorization, quantization, mixed precision, layout changes.
Build/extend tooling for benchmarking, tracing, and automated regression/perf testing.
Collaborate with compiler/runtime teams to land graph- and kernel-level improvements.
Apply ML/RL-based techniques (e.g., cost models, schedulers, autotuners) to search better execution plans.
Translate promising research/prototypes into reliable, scalable production features and services.

The target annual compensation (based on 2080 hours per year) ranges from $78,000 to $168,000 depending on education, experience and demonstrated expertise.

About the ideal candidate

Master or PhD degree in Computer Science or related fields. Solid experience in ML systems or performance engineering (industry, OSS, or research). Fluency in Python and C++.
Hands-on with at least one compute stack: CUDA/ROCm, OpenCL, Metal/Vulkan compute, Triton, vendor or open source NPUs.
Practical knowledge of PyTorch or TensorFlow/JAX and inference/training performance basics (mixed precision, graph optimizations, quantization).
Ability to turn ambiguous perf problems into measurable, repeatable experiments.
AI compiler exposure: TVM, IREE, XLA/MLIR, TensorRT, or similar. Profiling skills (Nsight, perf, VTune, CUPTI/ROCm tools) and comfort reading roofline/memory-hierarchy signals.
Experience with kernel scheduling/auto-tuning (RL, Bayesian/EA search) and hardware counters.
Background with custom accelerators/NPUs, DMA/tiling/SRAM management, or quantization (INT8/FP8).
Contributions to relevant OSS (links welcome).

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.