Enable job alerts via email!

Co-op Software Engineer - AI System & Infrastructure

Huawei Canada

Vancouver

On-site

CAD 60,000 - 100,000

Full time

3 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An innovative firm is seeking a Co-op Engineer to join their Intelligent Cloud Infrastructure Lab. This role focuses on enhancing AI infrastructure by tackling scalability and performance challenges in current systems while pioneering new technologies. The ideal candidate will possess advanced degrees in Computer Science or Engineering, with a strong background in distributed systems and proficiency in programming languages. Join a team that values collaboration and creativity, and contribute to shaping the future of cloud technologies in a dynamic environment.

Qualifications

Experience with container virtualization technologies.
Knowledge in distributed system design including serverless tech.

Responsibilities

Identify scalability issues in AI system infrastructure.
Collaborate with teams to improve system performance.

Skills

Building large-scale distributed systems

High-performance system design

Interpersonal and communication skills

Programming in C/C++, Go, Java, Rust, Python, C#

Education

Master's or PhD in Computer Science

Tools

Nvidia TensorRT

Triton servers

Kubernetes

PyTorch framework

CUDA libraries

GPU technologies

Huawei Canada has an immediate Co-op opening for an Engineer.

About the team:

The Intelligent Cloud Infrastructure Lab aims to innovate technologies, algorithms, systems, and platforms for next-generation cloud infrastructure. The lab addresses scalability, performance, and resource utilization challenges in existing cloud services while preparing for future challenges with appropriate technologies and architectures. Additionally, the lab aims to understand industry dynamics and technology trends to create a robust ecosystem.

About the job:

Understand AI system and infrastructure technology landscape, and identify scalability/performance issues or challenges of current LLM/multi-modal LLM systems.
Initiate and charter innovation projects to build or re-architect AI infrastructure platform, and plan milestones accordingly.
Provide or contribute to a scalable and high-performance architecture design or redesign for the infrastructure system that is optimized for AI training and inference, including but not limited to cluster management and scheduling, LLM model deployment, elastic LLM, and AI container cold/warm start-up optimization.
Collaborate with internal and external teams to deliver projects or project features that improve system scalability and performance.

The target annual compensation (based on 2080 hours per year) ranges from $60,000 to $100,000 depending on education, experience, and demonstrated expertise.

About the ideal candidate:

Master's or PhD degree in Computer Science or Computer Engineering.
Experience in building large-scale and high-performance distributed systems.
Experience with Nvidia TensorRT and/or Triton servers, and container virtualization technologies.
Knowledge and experience in distributed system design and development, including serverless technologies.
Work experience with one or more of the following technologies: vLLM, Ray, SGLang, Kubernetes, TensorRT-LLM, PyTorch framework, CUDA libraries, GPU technologies.
Proficiency in programming languages such as C/C++, Go, Java, Rust, Python, C#.
Excellent interpersonal and communication skills to collaborate effectively with multiple teams and build strong partnerships.
Proven success in working on software engineering problems that span multiple products.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs