Enable job alerts via email!

Principal Architect - Large Model and Training System Performance Optimization

Huawei

Vancouver

On-site

USD 121,000 - 230,000

Full time

Yesterday
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Huawei Canada is seeking a Principal Architect for their Computing Data Application Acceleration Lab. The role involves leading architectural design for AI training products, analyzing industry requirements, and collaborating with global teams to foster innovation. Ideal candidates will possess a Master’s or PhD and extensive experience in AI systems and integration.

Qualifications

  • 5+ years of experience in architecting large-scale AI training systems.
  • Hands-on experience with veRL or Ray for large-scale model training.
  • Solid understanding of deep learning fundamentals.

Responsibilities

  • Lead the architecture design of Ascend training products.
  • Analyze mainstream scenario requirements and industry technology trends.
  • Spearhead project planning and define technology roadmap.

Skills

Documentation skills
Communication skills
Proactive attitude
Problem-solving
Adaptability
Programming in C / C++
Programming in Python
Performance optimization

Education

Master’s or PhD in Computer Science, Math / Statistics

Tools

PyTorch
Nsight Systems
Nsight Compute
DLProf
Megatron
DeepSpeed

Job description

Huawei Canada has an immediate permanent opening for a Principal Architect

About the team :

The Computing Data Application Acceleration Lab aims to create a leading global data analytics platform organized into three specialized teams using innovative programming technologies. This team focuses on full-stack innovations, including software-hardware co-design and optimizing data efficiency at both the storage and runtime layers. This team also develops next-generation GPU architecture for gaming, cloud rendering, VR / AR, and Metaverse applications.

One of the goals of this lab is to enhance algorithm performance and training efficiency across industries, fostering long-term competitiveness.

About the job :

  • Lead the architecture design of Ascend training products, driving the continuous evolution of architectural competitiveness.
  • Analyze mainstream scenario requirements and industry technology trends for Ascend, introducing innovative technologies to ensure sustained leadership in architectural competitiveness.
  • Identify requirements for MindX, AI frameworks, acceleration libraries, and chip hardware, building a robust software-hardware architecture for Ascend training to achieve ongoing commercial success.
  • Collaborate with other departments / teams from Huawei’s global research centers to align on strategic goals.
  • Spearhead project planning and define the technology / product development roadmap to guide long-term innovation.

The base salary for this position ranges from $121,000 to $230,000 depending on education, experience and demonstrated expertise.

About the ideal candidate :

  • Master’s or PhD in Computer Science, Math / Statistics, with a focus on AI & Deep Learning.
  • 5+ years of experience in architecting large-scale AI training systems or similar complex software-hardware integrated solutions.
  • Excellent documentation skills for writing internal reports and / or publishing research papers. Effective communication skills for presentations to internal and external audiences. A proactive attitude with a strong ability to tackle challenges and adapt to evolving requirements and dynamic work environment.
  • Working knowledge of AI accelerators or full-stack AI acceleration systems and Deep Reinforcement Learning.
  • Hands-on experience with veRL or Ray for large-scale model training.
  • Familiarity with processor architectures and relevant work experience, with hands-on expertise in designing and developing complex system software architectures, and experience in performance optimization on GPU / NPU or similar hardware platforms.
  • Solid understanding of deep learning fundamentals, proficiency with the PyTorch framework, and practical experience in performance optimization using upper-layer distributed frameworks such as Megatron or DeepSpeed.
  • Strong programming skills with proficiency in C / C++ and Python.
  • Experience using performance analysis tools such as Nsight Systems, Nsight Compute, and DLProf.

J-18808-Ljbffr

Create a job alert for this search
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Principal Architect- Cyber Security

lululemon

Vancouver null

On-site

On-site

CAD 187,000 - 225,000

Full time

16 days ago

Principal Architect - Large Model and Training System Performance Optimization

Huawei

Vancouver null

On-site

On-site

USD 121,000 - 230,000

Full time

30+ days ago