Enable job alerts via email!

Senior AI Infrastructure & Platform Engineer - Riyadh,KSA

DeepSource Technologies

Riyad Al Khabra

On-site

SAR 200,000 - 300,000

Full time

Today

Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading technology firm in Al-Qassim Province is seeking a highly skilled Senior AI Infrastructure & Platform Engineer to manage and optimize GPU-based AI infrastructure. Responsibilities include deploying GPU clusters, managing orchestration tools, and collaborating with data science teams to ensure optimal performance. The ideal candidate is experienced in Nvidia tools and scripting for automation, with strong Linux administration skills. This role is pivotal for supporting high-performance workloads in AI applications.

Qualifications

Proven experience managing GPU-based AI / ML infrastructure and compute clusters.
Hands-on experience with Nvidia tools and orchestration.
Strong scripting and automation ability for deployment and maintenance.

Responsibilities

Deploy, maintain, and optimize GPU-based compute clusters.
Manage GPU orchestration tools and platforms.
Work with data scientists to define infrastructure requirements.

Skills

GPU-based AI / ML infrastructure management

Nvidia Base Command Manager

Nvidia AI Enterprise Suite

Slurm

Kubernetes orchestration

Linux system administration

Scripting (Bash, Python)

Performance tuning

Tools

Nvidia GPU / Network Operators

Canonical Ubuntu

Terraform

Ansible

Role Overview

We are seeking a highly skilled Senior AI Infrastructure & Platform Engineer to join our client’s team in Riyadh. In this role, you’ll be responsible for building, managing, and optimizing scalable AI infrastructure and compute environments that support high-performance workloads, including GPU-accelerated AI / ML pipelines, cluster scheduling, and orchestration.

Key Responsibilities

Deploy, maintain, and optimize GPU-based compute clusters and infrastructure.
Manage and operate GPU orchestration tools and platforms such as:
- Nvidia Base Command Manager (critical)
- Nvidia AI Enterprise Suite
- Nvidia GPU and Network Operators
- Nvidia NIMs and Blueprints
Configure, deploy, and maintain compute workloads using scheduling and orchestration tools including:
- Slurm (critical)
- Vanilla Kubernetes
Install, configure, and maintain the underlying OS (e.g. Canonical Ubuntu) and supporting system software.
Monitor and troubleshoot infrastructure performance, availability, and reliability; ensure high uptime for AI / ML workloads.
Work with data scientists, ML engineers, and dev teams to define infrastructure requirements, resource allocation, and deployment workflows.
Develop automation scripts, CI / CD pipelines, and best practices for infrastructure provisioning and management.
Document architecture, configurations, and operational procedures; enforce security, compliance, and backup policies.

Requirements

Required Skills & Experience

Proven experience managing GPU-based AI / ML infrastructure and compute clusters.
Hands-on experience with:
- Nvidia Base Command Manager
- Nvidia AI Enterprise Suite
- Nvidia GPU / Network Operators, NIMs, Blueprints
Strong experience with Slurm and / or Kubernetes orchestration.
Solid Linux system administration skills — preferably on Ubuntu or similar distributions.
Strong scripting / automation ability (e.g. Bash, Python, or relevant tooling) for provisioning, deployment, and maintenance.
Excellent troubleshooting and performance-tuning skills.
Experience collaborating with ML / data science teams and integrating infrastructure with their workflows.
Strong understanding of networking, security, resource allocation, and cluster management best practices.

Preferred Qualifications

Previous experience working in a high-performance computing (HPC) or AI-focused infrastructure team.
Knowledge of containerization, container orchestration, and GPUs in cloud or on-prem environments.
Experience with CI / CD, infrastructure-as-code (e.g. Terraform, Ansible), monitoring tools, and logging setups.
Familiarity with workload scheduling, job queuing, resource quotas, and GPU-shared environments.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Top cities

Top companies

Popular jobs