Overview
IXL Cloud enables businesses, start-ups, researchers, and developers to train, deploy, and scale their AI systems with unmatched performance and flexibility.
We accelerate their AI journey by delivering leading GPU infrastructure, seamless scalability, and AI-first operational support—helping bring advanced AI applications to fruition without the complexity of managing underlying compute architecture.
Responsibilities
As a Cloud Infrastructure Engineer, you will:
- Design, deploy, and maintain scalable cloud infrastructure for GPU workloads using tools like Terraform, Ansible, and Kubernetes.
- Automate provisioning of compute resources across bare-metal and cloud environments.
- Manage container orchestration platforms (Kubernetes, Docker) for multi-tenant GPU cluster environments.
- Monitor infrastructure performance, uptime, and system health using observability tools (Prometheus, Grafana, ELK, etc.).
- Maintain and optimize storage, networking, and load balancing layers for high-throughput AI workloads.
- Implement CI/CD pipelines for both infrastructure and application-level changes.
- Collaborate with software engineers, platform teams, and AI researchers to understand workload needs and optimize system performance accordingly.
- Ensure infrastructure security, including secrets management, RBAC, and compliance with best practices.
- Troubleshoot and resolve infrastructure incidents, scaling issues, and performance bottlenecks.
- Support hardware provisioning, firmware updates, and GPU driver/CUDA installations.
Qualifications
- 3–7 years of experience in DevOps, Site Reliability, or Infrastructure Engineering roles.
- Deep experience managing Linux systems in production environments.
- Experience deploying and managing Kubernetes clusters at scale (bare metal or cloud-native).
- Familiarity with GPU drivers (NVIDIA, CUDA) and workload optimization is a plus.
- Proficiency in scripting languages (Bash, Python, Go, etc.).
- Strong understanding of networking, firewalls, and storage systems in distributed compute environments.
- Experience with CI/CD tools such as GitLab CI, ArgoCD, Jenkins, or Flux.
- Excellent communication and documentation skills.