Enable job alerts via email!

Infrastructure/GPU Engineer

Cognizant

Denver (CO)

Remote

USD 99,000 - 116,000

Full time

Yesterday
Be an early applicant

Job summary

A leading technology company is seeking a hands-on Infrastructure Engineer to design and deploy AI-optimized environments leveraging NVIDIA DGX systems. The ideal candidate will possess deep expertise in infrastructure deployment, workload orchestration, and performance optimization. This remote role offers a salary range of $99,000 to $116,000, depending on experience. Applicants are encouraged to apply before 10/21/2025.

Qualifications

  • Deep understanding of NVIDIA DGX architecture and GPU compute.
  • Strong Linux system administration skills and shell scripting expertise.
  • Experience with Slurm, parallel filesystems, and high-speed networking.

Responsibilities

  • Architect and deploy NVIDIA DGX systems and GPU-based compute clusters.
  • Configure and manage Slurm Workload Manager for job scheduling.
  • Implement system health checks and diagnostics across compute, storage, and network layers.

Skills

NVIDIA DGX architecture
Linux system administration
Slurm
High-speed networking (InfiniBand/RDMA/RoCE)
Shell scripting
Containerization (Docker)
Orchestration (Kubernetes)
Automation tools (Ansible, Redfish)

Tools

TerraForm
PXE boot
Run.ai
ClearML
Job description
Overview

Cognizant is seeking a highly skilled hands-on Infrastructure Engineer with proven experience in the physical and technical deployment of AI-ready environments optimized for AI and machine learning workloads. This role focuses on NVIDIA DGX or similar systems, GPU-accelerated compute clusters, high-speed networking, and scalable storage solutions. The ideal candidate will have deep expertise in infrastructure design, deployment, workload orchestration, and performance optimization in enterprise environments.

This is a remote role in the US. Salary range for this role is between $99,000 and $116,000 depending on skills and qualifications of the candidate. Applications will be accepted till 10/21/2025.

Key Responsibilities
System Design & Deployment
  • Help in rightsizing GPU investment
  • Architect and deploy NVIDIA DGX systems and GPU-based compute clusters.
  • Design and implement scalable parallel filesystems (e.g., Lustre, BeeGFS, GPFS).
  • Integrate high-speed interconnects using InfiniBand, RoCE, and RDMA.
  • Collaborate on rack planning and airflow optimization.
Cluster & Infrastructure Management
  • Configure and manage Slurm Workload Manager for job scheduling.
  • Deploy and maintain cluster orchestration tools
  • Automate provisioning using PXE boot, Terraform, Redfish, and Kubernetes.
  • Perform firmware updates, BIOS/IPMI/BMC configuration, and OS provisioning
  • Knowledge of Run.ai, ClearML or similar platform
Networking & Performance Optimization
  • Design and validate network topologies including IPMI, internal/external networks, and InfiniBand fabrics.
  • Optimize RDMA and RoCE configurations for low-latency, high-throughput data transfers.
  • Conduct performance benchmarking using GPU-Burn, NCCL, and NVSM.
Monitoring & Troubleshooting
  • Implement system health checks and diagnostics across compute, storage, and network layers.
  • Troubleshoot hardware/software issues and ensure reliable infrastructure operation.
Required Skills & Qualifications
Technical Expertise
  • Deep understanding of NVIDIA DGX architecture, CUDA, and GPU compute.
  • Strong Linux system administration and shell scripting skills.
  • Experience with Slurm, parallel filesystems, and high-speed networking (InfiniBand/RDMA/RoCE).
  • Familiarity with containerization (Docker), orchestration (Kubernetes), and automation tools (Ansible, Redfish).
Preferred Qualifications
  • Experience with BBCM, and DGX BasePOD/SuperPOD configuration

Certifications by Nvidia or equivalent OEM.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.