Enable job alerts via email!

Technical Specialist - Infrastructure & Systems for Large Models

Huawei Canada

Edmonton

On-site

CAD 60,000 - 80,000

Full time

2 days ago
Be an early applicant

Job summary

A leading telecommunications company is seeking a Technical Specialist - Infrastructure & Systems to maintain the core infrastructure for large-scale AI training. This entry-level contract position in Edmonton requires 1-2 years of software engineering experience, proficiency in Python, and familiarity with ML frameworks like PyTorch or TensorFlow. Ideal candidates will have strong communication skills and a desire to learn about large-scale ML systems.

Qualifications

  • 1-2 years of software engineering experience.
  • Proficiency in Python; experience in backend or infrastructure development.
  • Familiarity with ML frameworks like PyTorch or TensorFlow.
  • Some exposure to distributed systems or cloud computing is a plus.
  • Understanding of software engineering best practices.

Responsibilities

  • Maintain core infrastructure for large-scale AI training.
  • Improve tools for managing training jobs across compute clusters.
  • Develop monitoring and logging tools for long-running jobs.
  • Collaborate with ML engineers on new training methods.

Skills

Software engineering
Python
ML frameworks (e.g., PyTorch, TensorFlow)
Linux
Docker
Communication skills
Collaboration

Tools

Docker
Linux

Job description

Technical Specialist - Infrastructure & Systems for Large Models

Join to apply for the Technical Specialist - Infrastructure & Systems for Large Models role at Huawei Canada .

Position Overview :

Huawei Canada has an immediate 12-month contract opening for a Member of Technical Specialist .

About the Team :

Founded in 2012, Noah’s Ark lab is a prominent research organization focused on advancing artificial intelligence and related fields to benefit society and the company. The lab works on impactful projects involving LLMs, RL, NLP, computer vision, AI theory, and autonomous driving, integrating innovations into products and services.

Job Responsibilities :

  • Maintain core infrastructure for large-scale AI training.
  • Contribute to data loading, training workflows, and checkpointing systems for distributed training.
  • Improve tools for managing training jobs across compute clusters (GPUs, TPUs, multi-node setups).
  • Develop monitoring and logging tools for reliable and observable long-running jobs.
  • Support optimization efforts like mixed precision and sharding to enhance training efficiency.
  • Collaborate with ML engineers and researchers on new training methods.
  • Scale systems, debug workloads, and ensure reproducibility of training pipelines.
  • Bridge research and infrastructure to accelerate AI development.

Candidate Requirements :

  • 1–2 years of software engineering experience.
  • Proficiency in Python; basic experience in backend or infrastructure development.
  • Familiarity with ML frameworks such as PyTorch or TensorFlow.
  • Some exposure to distributed systems, training jobs, or cloud computing is a plus.
  • Comfortable with Linux, Docker, and command-line tools.
  • Understanding of software engineering best practices, including testing and version control.
  • Eager to learn about large-scale ML systems and infrastructure design.
  • Strong communication and collaboration skills, with enjoyment working in cross-functional teams.
  • Additional Details :

  • Seniority Level : Entry level
  • Employment Type : Contract
  • Job Function : Information Technology
  • Industry : Telecommunications
  • J-18808-Ljbffr

    Get your free, confidential resume review.
    or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

    Similar jobs