Overview
About the Role
Raydian Cloud is seeking a forward-thinking DevOps Engineer to help build and scale infrastructure that powers cutting-edge AI workloads. You’ll work at the intersection of cloud-native technologies and Artificial Intelligence operations (AIOps), enabling high-performance, secure, and automated environments for AI development and deployment. Your expertise in Infrastructure as Code and Kubernetes will be critical in supporting scalable AI pipelines and platform services.
Responsibilities
- Design and manage cloud infrastructure optimized for AI/ML workloads using Infrastructure as Code (Terraform, Pulumi, etc.)
- Deploy and maintain Kubernetes clusters tailored for GPU scheduling, distributed training, and inference workloads
- Build CI/CD pipelines for AI model training, validation, and deployment across environments
- Collaborate with data scientists and ML engineers to streamline model lifecycle management
- Implement observability and monitoring for AI services (e.g., Prometheus, Grafana, OpenTelemetry)
- Ensure infrastructure security, compliance, and cost-efficiency in multi-tenant AI environments
- Automate provisioning of AI-specific resources (e.g., GPU nodes, storage volumes, feature stores)
- Document infrastructure patterns, DevOps workflows, and platform architecture
Required Skills & Qualifications
- Strong experience with Kubernetes, including GPU scheduling and Helm
- Proficiency in Infrastructure as Code tools (Terraform, Pulumi, etc.)
- Familiarity with cloud platforms (AWS, Azure, GCP) and AI services (e.g., SageMaker, Vertex AI)
- Experience with CI/CD tools (GitHub Actions, GitLab CI, Argo Workflows)
- Scripting skills in Python, Bash, or Go
- Understanding of ML model lifecycle and data pipeline orchestration
- Excellent communication and collaboration skills across technical and business teams
Nice to Have
- Experience with Kubeflow, MLflow, or similar MLOps frameworks
- Knowledge of containerized AI workloads (e.g., TensorFlow Serving, Triton Inference Server)
- Familiarity with service mesh technologies (Istio, Linkerd) in AI microservices
- Certifications in Kubernetes or cloud platforms (CKA, AWS DevOps Engineer)
Why Join Raydian Cloud?
- Shape the future of AI infrastructure and platform services
- Work with a visionary team blending deep tech and strategic execution
- Influence architecture decisions in a fast-moving AI startup environment
- Competitive compensation, flexible work culture, and growth opportunities