Enable job alerts via email!

Technical Specialist - Infrastructure & Systems for Large Models

Huawei Canada

Edmonton

On-site

CAD 60,000 - 80,000

Full time

6 days ago
Be an early applicant

Job summary

An innovative technology firm in Edmonton is looking for a Technical Specialist - Infrastructure & Systems for Large Models. The role involves maintaining infrastructure for AI training, developing tools, and supporting optimization efforts. Ideal candidates will have software engineering experience, proficiency in Python, and a strong desire to learn about large-scale machine learning systems. This is a contracted entry-level position.

Qualifications

  • 1–2 years of software engineering experience.
  • Basic experience in backend or infrastructure development.
  • Familiarity with ML frameworks such as PyTorch or TensorFlow.
  • Some exposure to distributed systems, training jobs, or cloud computing is a plus.
  • Strong communication and collaboration skills.

Responsibilities

  • Maintain core infrastructure for large-scale AI training.
  • Contribute to data loading, training workflows, and checkpointing systems.
  • Develop monitoring and logging tools for reliable jobs.

Skills

Python
Software engineering
Machine Learning frameworks
Linux
Docker
Command-line tools
Job description
Overview

Technical Specialist - Infrastructure & Systems for Large Models

Join to apply for the Technical Specialist - Infrastructure & Systems for Large Models role at Huawei Canada.

Position Overview

Huawei Canada has an immediate 12-month contract opening for a Member of Technical Specialist.

About the Team

Founded in 2012, Noah’s Ark lab is a prominent research organization focused on advancing artificial intelligence and related fields to benefit society and the company. The lab works on impactful projects involving LLMs, RL, NLP, computer vision, AI theory, and autonomous driving, integrating innovations into products and services.

Job Responsibilities
  • Maintain core infrastructure for large-scale AI training.
  • Contribute to data loading, training workflows, and checkpointing systems for distributed training.
  • Improve tools for managing training jobs across compute clusters (GPUs, TPUs, multi-node setups).
  • Develop monitoring and logging tools for reliable and observable long-running jobs.
  • Support optimization efforts like mixed precision and sharding to enhance training efficiency.
  • Collaborate with ML engineers and researchers on new training methods.
  • Scale systems, debug workloads, and ensure reproducibility of training pipelines.
  • Bridge research and infrastructure to accelerate AI development.
Candidate Requirements
  • 1–2 years of software engineering experience.
  • Proficiency in Python; basic experience in backend or infrastructure development.
  • Familiarity with ML frameworks such as PyTorch or TensorFlow.
  • Some exposure to distributed systems, training jobs, or cloud computing is a plus.
  • Comfortable with Linux, Docker, and command-line tools.
  • Understanding of software engineering best practices, including testing and version control.
  • Eager to learn about large-scale ML systems and infrastructure design.
  • Strong communication and collaboration skills, with enjoyment working in cross-functional teams.
Additional Details
  • Seniority Level : Entry level
  • Employment Type : Contract
  • Job Function : Information Technology
  • Industry : Telecommunications
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.