Role Overview
We are seeking a highly skilled Senior AI Infrastructure & Platform Engineer to join our client’s team in Riyadh. In this role, you’ll be responsible for building, managing, and optimizing scalable AI infrastructure and compute environments that support high-performance workloads, including GPU-accelerated AI/ML pipelines, cluster scheduling, and orchestration.
Key Responsibilities
- Deploy, maintain, and optimize GPU-based compute clusters and infrastructure.
- Manage and operate GPU orchestration tools and platforms such as:
- Nvidia Base Command Manager (critical)
- Nvidia AI Enterprise Suite
- Nvidia GPU and Network Operators
- Nvidia NIMs and Blueprints
- Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including:
- Slurm (critical)
- Vanilla Kubernetes
- Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software.
- Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI/ML workloads.
- Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows.
- Develop automation scripts, CI/CD pipelines, and best practices for infrastructure provisioning and management.
- Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies.
Required Skills & Experience
- Proven experience managing GPU-based AI/ML infrastructure and compute clusters.
- Hands-on experience with:
- Nvidia Base Command Manager
- Nvidia AI Enterprise Suite
- Nvidia GPU/Network Operators, NIMs, Blueprints
- Strong experience with Slurm and/or Kubernetes orchestration.
- Solid Linux system administration skills — preferably on Ubuntu or similar distributions.
- Strong scripting/automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance.
- Excellent troubleshooting and performance-tuning skills.
- Experience collaborating with ML/data science teams and integrating infrastructure with their workflows.
- Strong understanding of networking, security, resource allocation, and cluster management best practices.
Preferred Qualifications
- Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team.
- Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
- Experience with CI/CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups.
- Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.