Job Search and Career Advice Platform

Enable job alerts via email!

Solution Architect - NVIDIA Cluster (End-To-End Design & Validation)

WNTD

Greater London

Hybrid

GBP 70,000 - 90,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading tech company is seeking a highly skilled Solution Architect in London, specializing in NVIDIA GPU clusters. This role involves designing and validating high-performance infrastructure for AI and ML workloads. Candidates should have deep experience in GPU architecture and cluster orchestration. The position offers a hybrid working model, requiring one day a week onsite in London, along with opportunities to work with cutting-edge technology.

Benefits

Hybrid working model
Opportunity to work on cutting-edge AI projects
Exposure to next-generation GPU infrastructure

Qualifications

  • Proven experience architecting NVIDIA GPU clusters at scale.
  • Strong hands-on understanding of GPU interconnects and architectures.
  • Deep knowledge of high-performance networking architectures.

Responsibilities

  • Lead the architecture of NVIDIA GPU clusters.
  • Approve hardware and platform selections for customer requirements.
  • Collaborate with DevOps teams to validate cluster orchestration.

Skills

NVIDIA GPU clusters experience
GPU interconnects (NVLink/NVSwitch) knowledge
InfiniBand and high-performance networking
Cluster orchestration (Kubernetes, Slurm)
AI/ML workload requirements familiarity
Linux systems engineering
Job description
Job Specification: Solution Architect - NVIDIA Cluster (End-to-End Design & Validation)

Location: London (1 day per week onsite)
Travel: Occasional travel to datacenter sites outside the UK
Engagement: Contract Inside IR35
Department: Engineering/Advanced Compute

Role Overview

We are seeking a highly skilled Solution Architect with deep experience in designing, validating, and delivering end-to-end NVIDIA GPU clusters in enterprise and hyperscale environments. This individual will own the full life cycle of architectural design-from requirements gathering through implementation oversight and performance validation. They will work closely with engineering, networking, DevOps, security, and datacenter operations teams to ensure high-performance, scalable, and resilient GPU infrastructure for AI, HPC, and ML workloads.

The role is primarily London-based one day per week, with occasional international travel required to support datacenter design reviews, deployment validation, or site acceptance testing.

Key Responsibilities
Architecture & Design
  • Lead the architecture of NVIDIA GPU clusters leveraging technologies such as H100/H200, NVLink, NVSwitch, DGX, HGX, or SuperPod-class designs.
  • Produce high-level and low-level designs (HLD/LLD), including compute, network, storage, and power/cooling considerations.
  • Validate hardware and platform selections, ensuring architectural alignment with customer requirements and scalability goals.
  • Design fabric architectures including InfiniBand (200/400Gb), RoCE, and high-performance east-west traffic patterns.
  • Ensure designs adhere to NVIDIA reference architectures (NVAIE, Base Command, DGX SuperPod specs, etc.).
Cluster Integration & Validation
  • Define and execute validation test plans for GPU cluster performance, resilience, networking throughput, and workload behaviour.
  • Oversee integration of GPU nodes, networking, and storage systems into the existing datacenter environment.
  • Collaborate with DevOps/Platform teams to validate cluster orchestration (Kubernetes, Slurm, Bright Cluster Manager, or equivalents).
  • Validate firmware, drivers, NCCL, CUDA libraries, and container environments for production readiness.
Deployment & Delivery Oversight
  • Provide technical leadership across the full deployment life cycle.
  • Partner with datacenter operations to ensure correct rack layouts, cabling, airflow and power design.
  • Support delivery teams during build-out phases, ensuring the design is executed correctly.
  • Participate in factory acceptance tests (FAT), site acceptance tests (SAT), and operational readiness reviews.
Stakeholder Collaboration
  • Work closely with internal and external teams including network engineering, platform engineering, procurement, and vendors such as NVIDIA, Mellanox, Supermicro, Dell, or HPE.
  • Provide technical guidance to customers, partners, and cross-functional engineering teams.
  • Communicate complex architectural concepts clearly to both technical and non-technical audiences.
Documentation & Governance
  • Produce detailed architecture documents, diagrams, acceptance criteria, and operational runbooks.
  • Ensure security, compliance, and governance standards are built into the design.
  • Provide knowledge transfer (KT) and training sessions to internal teams where required.
Required Skills & Experience
Technical Expertise
  • Proven experience architecting and delivering NVIDIA GPU clusters at scale (AI/ML/HPC environments).
  • Strong hands‑on understanding of GPU interconnects (NVLink/NVSwitch) and DGX/HGX/SuperPod architectures.
  • Deep knowledge of InfiniBand and high-performance networking architectures.
  • Experience with cluster orchestration: Kubernetes, Slurm, PBS, or similar.
  • Familiarity with AI/ML workload requirements, CUDA, Docker/OCI containers, and NVIDIA software stacks (NCCL, CUDA Toolkit).
  • Comfort with Linux systems engineering, hardware validation, and troubleshooting across compute/network layers.
Soft Skills
  • Strong communication skills, with the ability to bridge engineering and business discussions.
  • Comfortable owning architecture decisions and delivering executive-ready documentation.
  • Ability to work autonomously while coordinating with multi-disciplinary teams.
  • Problem‑solver with strong critical‑thinking abilities and a delivery‑focused mindset.
Desirable Experience
  • Experience with hyperscaler-class deployments or multi‑megawatt datacenter environments.
  • Work with NVIDIA Base Command Manager or similar cluster management tooling.
  • Exposure to data pipelines, storage systems (Lustre, GPUDirect Storage, Ceph), or AI workflow platforms.
  • Certifications such as NVIDIA Certified Associate/Expert, Kubernetes certifications (CKA/CKS), or related vendor accreditations.
What We Offer
  • Hybrid working: 1 day per week in London
  • Opportunity to design next-generation high-performance GPU infrastructure
  • Exposure to cutting-edge AI compute at scale
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.