Brief description of the vacancy
We are seeking a Senior MLops Engineer with proven experience in deploying and managing large-scale ML infrastructure for LLMs, TTS, STT, Stable Diffusion, and other GPU-intensive models in production. You will lead the design and operation of cost-efficient, high-availability, and high-performance serving stacks in a Kubernetes-based AWS environment.
About the company
Company Identity AI Labs
A fast-growing and well-funded AI startup in the UAE. Mission of the company is to redefine how humans interact with AI through emotionally intelligent, relationship-focused technology
Responsibilities
- You will architect, deploy, and maintain scalable ML infrastructure on AWS EKS using Terraform and Helm.
- You will own end-to-end model deployment pipelines for LLMs, diffusion models (LDM / Stable Diffusion), and other generative / AI models requiring high GPU throughput.
- You will design cost-effective, auto-scaling serving systems using tools like Triton Inference Server, vLLM, Ray Serve, or similar model-serving frameworks.
- You will build and maintain CI / CD pipelines integrating the ML model lifecycle (training → validation → packaging → deployment).
- You will optimize GPU resource utilization and implement job orchestration with frameworks like KServe, Kubeflow, or custom workloads on EKS.
- You will deploy and manage FluxCD (or ArgoCD) for GitOps-based deployment and environment promotion.
- You will implement robust monitoring, logging, and alerting for model health and infrastructure performance (Prometheus, Grafana, Loki).
- You will collaborate closely with ML Engineers and Software Engineers to ensure smooth integration, observability, and feedback loops.
Requirements
- 2–3 years of experience with model serving frameworks such as Triton, vLLM, Ray Serve, TorchServe, or similar.
- 2–3 years of experience deploying and optimizing LLMs and LDMs (e.g., Stable Diffusion) under high load with GPU-aware scaling.
- 3–4 years of experience with Kubernetes (EKS) and infrastructure-as-code (Terraform, Helm).
- 4–5 years of hands-on software engineering experience in Python, with production-grade experience in ML model lifecycle.
- Nice to have: familiarity with Go or Rust for backend or performance-critical systems.
Working conditions
Full time job in Dubai office, official employment and full relocation package