Enable job alerts via email!

Cloud Ops Engineer

Indotronix UK

Minnesota

Remote

USD 90,000 - 130,000

Full time

Today

Be an early applicant

Job summary

A leading tech company is seeking a skilled DevOps and AI Cloud Infrastructure Engineer to manage a GPU-based compute environment. The ideal candidate will have expertise in Linux system administration, cloud platforms, and GPU hardware management, with a focus on AI/ML workloads. Responsibilities include maintaining high availability, performance optimization, and troubleshooting issues in cloud infrastructure. Join us to work closely with architects and AI engineers on cutting-edge projects.

Qualifications

3+ years of experience in DevOps or cloud infrastructure management.
At least 1 year working with GPU-based compute environments in the cloud.

Responsibilities

Provision, deploy, and maintain cloud infrastructure for AI workloads.
Administer Linux-based servers optimized for GPU workloads.
Diagnose and resolve issues related to GPU compute nodes.
Develop Infrastructure as Code (IaC) to automate resource management.
Build CI/CD pipelines using tools like GitHub Actions.

Skills

Linux system administration

Cloud platforms

Containerization

Cluster computing

GPU hardware management

AI/ML workloads

High-performance computing (HPC)

Tools

Terraform

Ansible

GitHub Actions

Prometheus

Grafana

NVIDIA GPUs

CUDA

Slurm

PBS Pro

Overview

Location : US Remote

We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuring high availability, performance, and security for compute-intensive workloads. The ideal candidate will have expertise in Linux system administration, cloud platforms, containerization, GPU hardware management, and cluster computing, with a focus on supporting AI/ML and high-performance computing (HPC) workloads. In this role, you will also provide technical support to investigate and resolve customer-reported issues related to the GPU-based compute environment. You will work closely with architects, AI engineers, and software developers to ensure seamless deployment, scalability, and reliability of our cloud-based AI/ML pipelines and GPU-based compute environments.

Responsibilities

Infrastructure Management: Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Digital Ocean Cloud to support AI workloads.
Documentation: Maintain clear documentation for infrastructure setups, and processes.
System Management: Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance.
GPU Infrastructure: Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in AI/ML and HPC applications.
Troubleshooting: Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters.
High-Speed Interconnects: Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads.
Automation: Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible to automate provisioning and management of resources.
CI/CD Pipelines: Build and optimize continuous integration and deployment (CI/CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions.
Containerization & Orchestration: Build and manage LXC-based containerized environments to support cloud infrastructure and provisioning toolchains.
Monitoring & Performance: Set up and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources.
Security and Compliance: Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001.
Cluster Support: Collaborate with other engineers to ensure seamless integration of networking with cluster management tools like Slurm, or PBS Pro.
Scalability: Optimize infrastructure for high-throughput AI workloads, including GPU and auto-scaling configurations.
Collaboration: Work closely with Architects, Software engineers to streamline model deployment, optimize resource utilization, and troubleshoot infrastructure issues.

Qualifications

Experience: 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Cloud Ops Engineer

Indotronix UK

Minnesota

Remote

USD 90,000 - 130,000

Full time

Job summary

Qualifications

Responsibilities

Skills

Tools

Company

Services

Free resources

Support

Cloud Ops Engineer

Indotronix UK

Minnesota

Remote

USD 90,000 - 130,000

Full time

Job summary

Qualifications

Responsibilities

Skills

Tools

Follow us

Company

Services

Free resources

Support