Enable job alerts via email!

DevOps / IT Infrastructure Engineer

MBR Partners

Dubai

On-site

AED 120,000 - 200,000

Full time

Today
Be an early applicant

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Job summary

A leading tech company in Dubai seeks a skilled DevOps Engineer to manage on-prem infrastructure for AI and enterprise workloads. You will design and operate Kubernetes clusters, automate processes, and collaborate with interdisciplinary teams to optimize the infrastructure for AI velocity. The role requires strong DevOps skills and a solid understanding of CI/CD processes. Competitive salary and work visa support available.

Benefits

Flexible salary based on profile
Work visa support

Qualifications

  • 5+ years in DevOps/SRE/Platform Engineering with hands-on ownership of on-prem environments.
  • Proven experience operating Kubernetes in production.
  • Strong Linux administration, scripting (Bash/Python), and troubleshooting across the stack.

Responsibilities

  • Design and operate on-prem infrastructure as code.
  • Build and run Kubernetes for AI.
  • Administer servers, networks, and core services.

Skills

DevOps/SRE/Platform Engineering
Kubernetes
IaC and automation (Terraform, Ansible, Helm)
Linux administration
CI/CD expertise
Networking fundamentals
Observability implementation

Tools

GitLab CI
Prometheus
Job description
Company Overview

Our client is a young high‑tech company incorporated in the heart of one of the world's fastest‑growing tech hubs—Dubai, UAE. As the exclusive software partner to one of the world's largest ODMs in the networking equipment space, they develop the Network Operating Systems that power critical data centre and telecom routing & switching infrastructure. Building on this foundation, they have recently launched an AI division focused on designing our own chips to accelerate inference and training workloads. What sets them apart is their unique position at the centre of a historic development: our ODM partner is establishing the first networking equipment factory of its kind in the GCC region, and they are the software engine driving this groundbreaking initiative. They are not just building technology— they are building a true networking vendor that serves regional interests while meeting the growing demand for networking equipment across the MENA region and further. Their long‑term vision extends beyond products to people: creating a thriving ecosystem for embedded systems and ASIC design talent that will produce generations of world‑class professionals, establishing our region as a global centre of excellence for Enterprise Compute innovation. As a rapidly growing company at the forefront of AI hardware innovation, they are constantly seeking talented and motivated individuals to join their team. We offer a dynamic and challenging work environment, with opportunities to make a significant impact on the future of AI technology.

Your Mission

Own the end‑to‑end design and operation of our on‑premise infrastructure for AI and enterprise workloads—built as code, automated, observable, and secure. You will architect and run Kubernetes clusters for training/inference, manage servers, networks, and core services, and enable developers with reliable CI/CD and platform tooling. This is where minutes, time‑to‑recovery and cost‑per‑job directly impact AI velocity at scale.

Responsibilities
  • Design and operate on‑prem infrastructure as code: author reusable Terraform/Ansible/Helm modules; build GitOps workflows (e.g., Argo CD) for repeatable, audited changes across environments.
  • Build and run Kubernetes for AI: configure multi‑tenant GPU clusters (MIG/GPUDirect RDMA, NVIDIA device plugins/DCGM), scheduling/quotas, HPA/Cluster Autoscaler (where applicable), and workload isolation.
  • Administer servers, networks, and core services: OS lifecycle (Linux), identity/SSO (Keycloak/LDAP), secrets (Vault), DNS/DHCP/NTP, artifact registries, and internal package mirrors.
  • Provide storage for AI pipelines: integrate and operate high‑bandwidth/low‑latency storage, tune for dataset staging and checkpointing patterns.
  • Enable CI/CD: partner with developers to design fast, reproducible pipelines (GitLab CI/GitHub Actions), caching and runners on GPU/CPU nodes, artifact provenance (SBOM, SLSA).
  • Collaborate with Platform and ML engineers running training/inference at scale, silicon and systems teams integrating hardware in the lab, security engineers safeguarding credentials and supply chain, application developers delivering services via CI/CD, and site ops supporting data centre deployments — together, we turn infrastructure into a product that accelerates the business.
Minimum Qualifications
  • 5+ years in DevOps/SRE/Platform Engineering with hands‑on ownership of on‑prem environments.
  • Proven experience operating Kubernetes in production (multi‑tenant RBAC, networking/CNI, storage, ingress, monitoring).
  • Proficiency with IaC and automation (Terraform, Ansible, Helm; GitOps with Argo CD/Flux).
  • Strong Linux administration, scripting (Bash/Python), and troubleshooting across the stack (compute, network, storage).
  • CI/CD expertise (GitLab CI/GitHub Actions), container build security (SBOM, image signing), and artifact management.
  • Solid networking fundamentals (L2/L3, routing, BGP, VLANs, EVPN/VXLAN, load balancing, TLS/mTLS).
  • Experience implementing observability (Prometheus/Grafana, logs, tracing) and running incident response.
Preferred (Nice‑to‑Haves)
  • GPU cluster operations for AI (NVIDIA drivers/operator, DCGM, MIG, GPUDirect RDMA, Slurm integration).
  • Storage for data‑intensive workloads (Ceph, parallel filesystems, NVMe‑oF) and performance tuning.
  • Secrets/identity platforms (Vault, Keycloak/LDAP/SSO), policy‑as‑code (OPA/Gatekeeper, Kyverno).
  • Security/compliance practices (CIS benchmarks, SLSA, supply‑chain scanning) and zero‑trust networking.
  • Data centre experience (rack/stack, power/cooling basics) and remote site rollout automation.
  • Familiarity with configuration management for network devices and API‑driven switches/routers.
Key Outcomes & Impact
  • Reproducible environments by default: any engineer can spin up an identical dev/test stack (K8s namespace, storage, secrets, runners) from Git in ≤30 minutes, with audit trails for every change.
  • Solid CI/CD for AI workflows: model/build/test pipelines are deterministic and cache‑efficient; median pipeline time down 30–50%, with artifact provenance (SBOM, signatures) and traceable datasets/checkpoints.
  • Predictable GPU orchestration: fair‑share scheduling, quotas, and isolation (MIG/namespace policies) keep queues short; cluster utilization increases >20% without starving latency‑sensitive jobs.
  • Lab‑to‑cluster continuity: hardware bring‑up images, drivers, and firmware are versioned and promoted through the same pipelines; new boards/nodes join clusters with push‑button automation.
  • Actionable observability: dashboards and alerts reflect SLOs meaningful to researchers (throughput, time‑to‑first‑token, I/O wait, GPU mem pressure); MTTR <30 minutes for priority services.
  • Cost & toil reduction: infra tasks automated to eliminate recurring manual work; fewer “custom one‑offs,” more reusable modules; quarterly infra spend per GPU hour trends down.
  • Clear docs & self‑service: engineers rely on concise runbooks and service catalogues; >80% of routine requests resolved via self‑service workflows rather than ad‑hoc ops support.

Work visas for Dubai can be obtained by the client. Salary information is flexible and depends on the candidate’s profile.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.