We are seeking a highly capable Infrastructure as Code (IaC) Engineer to lead the design, implementation, and management of automated infrastructure provisioning for high-performance AI data centers.
This role is central to orchestrating compute, network, storage, and virtualization layers using modern IaC tools across on-premises and hybrid cloud environments.
The ideal candidate will play a strategic role in enabling scalable and repeatable deployment pipelines that support GPU clusters, AI model training environments, and containerized platforms such as Kubernetes and Responsibilities :
- Design and implement IaC frameworks to automate the provisioning and configuration of data center infrastructure for AI workloads.
- Orchestrate and manage multi-layer automation across compute (GPU / CPU), networking (VXLAN, EVPN, BGP), storage (NVMe, object, parallel file systems), and virtualization platforms (KVM, VMware, OpenShift).
- Develop reusable Terraform modules, Ansible playbooks, and YAML templates to define infrastructure in version-controlled environments.
- Automate deployment of Kubernetes clusters and integrate with GPU operators for training and inference pipelines.
- Build and maintain CI / CD pipelines to deploy, test, and manage infrastructure changes using tools like GitLab CI / CD, Jenkins, or ArgoCD.
- Integrate with monitoring and observability stacks (Prometheus, Grafana, DCGM) for automated infrastructure validation and health monitoring.
- Work closely with AI / ML platform teams to align infrastructure deployment with model training, data pipelines, and security policies.
- Ensure compliance with security and operational standards through policy-as-code and drift detection Skills & Experience :
- 5+ years of experience in infrastructure automation or SRE roles with hands-on IaC deployment.
- Proficiency in Terraform, Ansible, and scripting languages such as Python, Bash, and YAML.
- Experience automating infrastructure in GPU-intensive environments supporting AI / ML workloads.
- Strong understanding of networking (VXLAN, EVPN, BGP, RoCE) and virtualization platforms (OpenShift, VMware, KVM).
- Familiarity with Kubernetes, Helm, Operators, and container orchestration frameworks.
- Exposure to storage automation for AI data lakes (e.g., Ceph, BeeGFS, Lustre, or S3-compatible storage).
- Experience with CI / CD tools (GitLab CI / CD, Jenkins, ArgoCD, Flux) in IaC Certifications :
- HashiCorp Certified : Terraform Associate
- Red Hat Certified Specialist in Ansible Automation
- CKA (Certified Kubernetes Administrator) or equivalent
- Cloud certifications (AWS, Azure, or GCP preferred for hybrid orchestration)