Overview
Huawei Canada has an immediate 12-month contract opening for a Machine Learning Software Engineer.
About the team
The Software-Hardware System Optimization Lab continuously improves the power efficiency and performance of smartphone products through software-hardware systems optimization and architecture innovation. We keep tracking the trends of cutting-edge technologies, building the competitive strength of mobile AI, graphics, multimedia, and software architecture for mobile phone products.
About the job
- Profile and optimize end-to-end ML workloads and kernels to improve latency, throughput, and efficiency across GPU/NPU/CPU.
- Identify bottlenecks (compute, memory, bandwidth) and land fixes: tiling, fusion, vectorization, quantization, mixed precision, layout changes.
- Build/extend tooling for benchmarking, tracing, and automated regression/perf testing.
- Collaborate with compiler/runtime teams to land graph- and kernel-level improvements.
- Apply ML/RL-based techniques (e.g., cost models, schedulers, autotuners) to search better execution plans.
- Translate promising research/prototypes into reliable, scalable production features and services.
The target annual compensation (based on 2080 hours per year) ranges from $78,000 to $168,000 depending on education, experience and demonstrated expertise.
About the ideal candidate
- Master or PhD degree in Computer Science or related fields. Solid experience in ML systems or performance engineering (industry, OSS, or research). Fluency in Python and C++.
- Hands-on with at least one compute stack: CUDA/ROCm, OpenCL, Metal/Vulkan compute, Triton, vendor or open source NPUs.
- Practical knowledge of PyTorch or TensorFlow/JAX and inference/training performance basics (mixed precision, graph optimizations, quantization).
- Ability to turn ambiguous perf problems into measurable, repeatable experiments.
- AI compiler exposure: TVM, IREE, XLA/MLIR, TensorRT, or similar. Profiling skills (Nsight, perf, VTune, CUPTI/ROCm tools) and comfort reading roofline/memory-hierarchy signals.
- Experience with kernel scheduling/auto-tuning (RL, Bayesian/EA search) and hardware counters.
- Background with custom accelerators/NPUs, DMA/tiling/SRAM management, or quantization (INT8/FP8).
- Contributions to relevant OSS (links welcome).