GPU Communications Architect

Sei unter den ersten Bewerbenden.
Nur für registrierte Mitglieder
Deutschland
EUR 80.000 - 120.000
Sei unter den ersten Bewerbenden.
Vor 2 Tagen
Jobbeschreibung

Project description

The ROCm Communication Collectives Library (RCCL) is a stand-alone library that provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs. It uses PCIe and xGMI high-speed interconnects.

Responsibilities

  • Provide deep technical leadership and guidance for GPU communication technologies, define the technical vision and direction for the GPU communication software stack.
  • Engage with executives and key stakeholders to provide insight into industry trends and recommend strategic initiatives. Influence the future direction of the company's technical portfolio.
  • Represent AMD in leadership positions at industry organizations and standards bodies.
  • Engage with clients and industry partners to deeply understand technical needs, ensuring their satisfaction with tailored solutions that leverage your experience in strategic customer engagements and architectural wins.
  • Collaborate with hardware and software architects, system engineers and business teams in identifying requirements and building roadmaps for future products.
  • Mentor engineers and technical leaders, fostering a culture of innovation and excellence. Help develop the next generation of leaders through coaching, training, and feedback.

SKILLS

Must have

  • Experience architecting and developing communication software solutions for accelerators using RDMA and accelerator-to-accelerator fabrics (eg. Infinity Fabric, UALink), from low-level device drivers and OS internals up through applications and AI/ML frameworks
  • Experience with communication stacks and low-level GPU drivers.
  • Experience developing or modifying compiler toolchains or debugging tools.
  • Deep understanding of graphics pipeline internals and GPU architecture
  • Deep expertise with distributed programming models (MPI, SHMEM), and the implementation and optimization of collective communication algorithms
  • Deep expertise with RoCE, RDMA, and network topologies
  • Experience with system software development in C/C++, and GPU software development and parallel programing
  • Analytical and performance analysis skills
  • Effective communication and problem-solving skills
  • Proven history of communication software thought leadership, backed with patents, publications, and participation in industry standards bodies

Nice to have

Advanced degrees, such as Master's or Ph. D. are preferred