Enable job alerts via email!

Principal Solutions Engineer, Infrastructure (SLURM & AI Focus)

AMD

Santa Clara (CA)

On-site

USD 163,000 - 235,000

Full time

5 days ago

Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

A leading technology company is seeking a Principal Solutions Engineer with a focus on AI to design and optimize high-performance computing infrastructures. The ideal candidate will deeply understand SLURM, RDMA networking, and storage solutions. Join AMD, where your contributions will shape the future of data center technology.

Qualifications

Extensive SLURM experience in production HPC environments.
Expert knowledge of RDMA technologies.
Hands-on GPU computing and Linux system administration skills.

Responsibilities

Build and design large GPU-accelerated clusters for AI/ML workloads.
Integrate SLURM with Kubernetes for hybrid workload management.
Architect parallel file systems like Lustre for AI data needs.

Skills

RDMA networking

Collective communications

Container orchestration

AI/ML workload optimization

Education

Bachelor’s degree in Computer Science, Engineering, or related field

Advanced degree preferred

Tools

SLURM

Kubernetes

Python

Bash

Principal Solutions Engineer, Infrastructure (SLURM & AI Focus)

Get AI-powered advice on this job and more exclusive features.

This range is provided by AMD. Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.

WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.

AMD together we advance_

Principal Solutions Engineer, Infrastructure (SLURM & AI Focus)

The Role

The AMD Datacenter GPU team is seeking an experienced Solutions Engineer specializing in high-performance computing infrastructure for AI workloads. This role focuses on designing, deploying, and optimizing GPU-accelerated computing environments for AI use-cases using SLURM as the primary workload manager.

The Person

The ideal candidate will have deep expertise in Multi-tenant Schedulers for large scale AI Clusters, RDMA networking, collective communications, container orchestration, and storage solutions optimized for AI/ML workloads.

Key Responsibilities

AI Infrastructure Design

Build and design large GPU-accelerated clusters for AI/ML workloads
Develop reference architectures for SLURM-based HPC environments
Integrate SLURM with Kubernetes for hybrid workload management
Design storage systems to support high-speed AI training pipelines

SLURM Optimization & Management

Configure and optimize SLURM for efficient AI/ML scheduling and resource use
Use advanced SLURM features such as GPU-aware scheduling, MPI integration, container runtime support, and fair-share policies
Develop SLURM plugins and customizations for AI workloads

Networking & Interconnect

Design RDMA network setups (InfiniBand, RoCE) for fast data transfer
Optimize collective communications for distributed training (e.g., All Reduce)
Configure GPU Direct RDMA and topology-aware job scheduling

Storage Solutions

Architect parallel file systems like Lustre, GPFS, BeeGFS for AI data needs
Implement high-performance scratch storage and tiered data management
Optimize I/O patterns and manage data lifecycle for training datasets

Container Orchestration & Integration

Collaborate on Kubernetes operators for SLURM integration
Develop strategies for seamless containerized AI workload management
Build CI/CD pipelines and enable hybrid cloud deployments

Collaboration & Support

Work with research teams and customers to meet AI computing needs
Provide technical guidance and training
Create documentation and best practices
Partner with vendors on hardware and software selection

Preferred Experience

Technical Skills

Extensive SLURM experience in production HPC environments
Expert knowledge of RDMA technologies and collective communications
Hands-on GPU computing and Linux system administration skills
Experience with parallel file systems and scripting (Python, Bash, Go)

Container & Orchestration

Production Kubernetes experience in HPC settings
Familiarity with Kubernetes SLURM plugin and container runtimes (Singularity, Docker)
Experience with Helm and Kubernetes operators

AI/ML Infrastructure

Understanding AI frameworks (PyTorch, TensorFlow, JAX) and distributed training
Knowledge of AI workload optimization and MLOps practices

Education

Bachelor’s degree in Computer Science, Engineering, or related field
Advanced degree preferred

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Seniority level

Seniority level
Mid-Senior level

Employment type

Employment type
Full-time

Job function

Industries
Semiconductor Manufacturing

Referrals increase your chances of interviewing at AMD by 2x

San Mateo, CA $163,200.00-$234,600.00 2 weeks ago

San Francisco Bay Area $130,000.00-$175,000.00 1 month ago

Senior Engineering Manager, Game Solutions Engineering

San Mateo, CA $338,270.00-$391,590.00 5 days ago

Mountain View, CA $204,000.00-$259,000.00 5 days ago

Sunnyvale, CA $165,000.00-$253,000.00 2 weeks ago

Solutions Architect, Conversational AI & Prompt Engineering

Santa Clara, CA $148,000.00-$235,750.00 4 days ago

Solutions Architect, Generative AI Specialist

Principal Solutions Architect - Silicon Photonics

Mountain View, CA $200,000.00-$260,000.00 6 days ago

Senior Solutions Architect, HPC Systems Engineer

Solutions Architect - Cloud Providers and Hyperscale

Senior Software Engineer - Localization and Mapping (SLAM)

Senior Solutions Architect, HPC Systems Engineer

Principal Engineer, Generative AI Solution Architect

Mountain View, CA $177,700.00-$266,450.00 2 weeks ago

Santa Clara, CA $155,000.00-$215,000.00 5 months ago

Computer Integrated Manufacturing Solution Architect (CIM Solution Architect)

Solutions Architect, Financial Services, Google Cloud

Sunnyvale, CA $147,000.00-$216,000.00 1 week ago

Mountain View, CA $204,000.00-$259,000.00 2 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Principal Solutions Engineer, Infrastructure (SLURM & AI Focus)

AMD

Santa Clara (CA)

On-site

USD 163,000 - 235,000

Full time

Job summary

Qualifications

Responsibilities

Skills

Education

Tools

Job description