Enable job alerts via email!

Software Engineer - AI System & Infrastructure

Huawei Canada

Vancouver

On-site

CAD 110,000 - 210,000

Full time

3 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking an innovative Engineer to join their Intelligent Cloud Infrastructure Lab. This role focuses on addressing scalability and performance challenges in AI systems, driving projects to enhance infrastructure platforms. Ideal candidates will possess advanced degrees and experience with large-scale distributed systems, as well as familiarity with cutting-edge technologies like Nvidia TensorRT and Kubernetes. Join a dynamic team dedicated to shaping the future of cloud infrastructure and AI technologies, where your contributions will directly impact the evolution of next-generation solutions.

Qualifications

Master/PhD in Computer Science or Engineering required.
Experience in building large-scale distributed systems.

Responsibilities

Identify scalability issues in AI systems and initiate innovation projects.
Design scalable architecture optimized for AI training and inferencing.

Skills

Distributed Systems

AI Infrastructure

Interpersonal Skills

Communication Skills

Education

Master's Degree in Computer Science

PhD in Computer Engineering

Tools

Nvidia TensorRT

Triton Servers

Kubernetes

Pytorch

Cuda Libraries

Huawei Canadahas an immediate permanent opening foran Engineer.

About the team:

The Intelligent Cloud Infrastructure Lab aims to innovate technologies, algorithms, systems, and platforms for next-generation cloud infrastructure. The lab addresses scalability, performance, and resource utilization challenges in existing cloud services while preparing for future challenges with appropriate technologies and architectures. Additionally, the lab aims to understand industry dynamics and technology trends to create a robust ecosystem.

About the job:

Understand AI System and Infrastructure technology landscape, and identify scalability/performance issues or challenges of current LLM/multi-modal LLM systems
Initiate and charter innovation projects to build or re-architect AI infrastructure platform, and plan milestones accordingly
Provide/contribute a scalable and high-performance architecture design or re-design for the infrastructure system that is optimized for AI training and inferencing, which includes but not limited to cluster management and scheduling, LLM model deployment, elastic LLM as well as AI container cold/warm start-up optimization, and so on.
Collaborate with internal and external teams to deliver the project or project features that improve our overall system scalability and performance.

The base salary for this position ranges from $110,000 to $210,000 depending on education, experience and demonstrated expertise

About the ideal candidate:

Master/PhD degree in Computer Science, Computer Engineering
Experience in building large scale and high-performance distributed system
Experience in Nvidia TensorRT and/or Triton servers. Experience in container virtualization technologies
Knowledge & experience in distributed system design & development, including serverless technologies
Work experience in one or more of the following technologies: vLLM, Ray, SGLang, Kubernetes, TensorRT-LLM, Pytorch framework, Cuda libraries, GPU technologies
Work experience in one or more of the following programming languages: C/C++, Go, Java, Rust, python, C#.
Have excellent interpersonal and communication skills to collaborate with multiple teams and build strong partnerships effectively.
Demonstrated success working on software engineering problems that span multiple products

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs