Enable job alerts via email!

Senior MLOps Engineer

Open Data Science

Dubai

On-site

AED 200,000 - 300,000

Full time

9 days ago

Job summary

A progressive AI startup in Dubai is seeking a Senior MLops Engineer to design and operate ML infrastructure for large-scale GPU models. The ideal candidate will have hands-on experience in model serving, Kubernetes, and Python, contributing to high-performance deployment pipelines and scalable systems in a dynamic environment.

Benefits

Full relocation package

Qualifications

  • 2–3 years of experience with model serving frameworks like Triton or Ray Serve.
  • 3–4 years of experience with Kubernetes and infrastructure-as-code tools.
  • 4–5 years of software engineering experience in Python.

Responsibilities

  • Architect and maintain scalable ML infrastructure on AWS EKS.
  • Own end-to-end model deployment pipelines for various AI models.
  • Design cost-effective auto-scaling serving systems.

Skills

Model serving frameworks expertise
Kubernetes experience
Python programming

Tools

Terraform
Helm
Job description

Brief description of the vacancy

We are seeking a Senior MLops Engineer with proven experience in deploying and managing large-scale ML infrastructure for LLMs, TTS, STT, Stable Diffusion, and other GPU-intensive models in production. You will lead the design and operation of cost-efficient, high-availability, and high-performance serving stacks in a Kubernetes-based AWS environment.

About the company

Company Identity AI Labs

A fast-growing and well-funded AI startup in the UAE. Mission of the company is to redefine how humans interact with AI through emotionally intelligent, relationship-focused technology

Responsibilities

  • You will architect, deploy, and maintain scalable ML infrastructure on AWS EKS using Terraform and Helm.
  • You will own end-to-end model deployment pipelines for LLMs, diffusion models (LDM / Stable Diffusion), and other generative / AI models requiring high GPU throughput.
  • You will design cost-effective, auto-scaling serving systems using tools like Triton Inference Server, vLLM, Ray Serve, or similar model-serving frameworks.
  • You will build and maintain CI / CD pipelines integrating the ML model lifecycle (training → validation → packaging → deployment).
  • You will optimize GPU resource utilization and implement job orchestration with frameworks like KServe, Kubeflow, or custom workloads on EKS.
  • You will deploy and manage FluxCD (or ArgoCD) for GitOps-based deployment and environment promotion.
  • You will implement robust monitoring, logging, and alerting for model health and infrastructure performance (Prometheus, Grafana, Loki).
  • You will collaborate closely with ML Engineers and Software Engineers to ensure smooth integration, observability, and feedback loops.

Requirements

  • 2–3 years of experience with model serving frameworks such as Triton, vLLM, Ray Serve, TorchServe, or similar.
  • 2–3 years of experience deploying and optimizing LLMs and LDMs (e.g., Stable Diffusion) under high load with GPU-aware scaling.
  • 3–4 years of experience with Kubernetes (EKS) and infrastructure-as-code (Terraform, Helm).
  • 4–5 years of hands-on software engineering experience in Python, with production-grade experience in ML model lifecycle.
  • Nice to have: familiarity with Go or Rust for backend or performance-critical systems.

Working conditions

Full time job in Dubai office, official employment and full relocation package

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.