MLOps Engineer
Onsite in Abu Dhabi, full relocation provided
Key Responsibilities
- Model Deployment: Oversee the deployment and scaling of large language models (LLMs) and other deep learning systems using modern inference engines such as vLLM, Triton, or TGI, with a focus on reliability and performance.
- Pipeline Engineering: Build and manage automated pipelines for model fine-tuning, evaluation, versioning, and continuous delivery using platforms like MLflow, Kubeflow, or comparable tooling.
- Infrastructure Management: Design and maintain cloud-native infrastructure for ML workloads, leveraging services from major cloud providers (e.g., EC2, Kubernetes, serverless functions, managed ML services).
- Performance Optimization: Implement robust monitoring and logging strategies, ensuring low-latency, high-availability systems that meet production-grade performance metrics.
- Cross-Functional Collaboration: Partner with data scientists, ML researchers, and software engineers to support experimentation workflows and ensure research-to-production continuity.
- DevOps & Automation: Create infrastructure-as-code (IaC) solutions and CI/CD pipelines for repeatable, secure deployments of ML systems.
- Model Optimization: Apply techniques such as quantization, pruning, and distributed inference to maximize performance while minimizing computational costs.
Qualifications
- Experience: 5+ years of hands-on experience in MLOps, ML infrastructure, or related engineering roles, with a strong track record in managing the full ML lifecycle.
- Deployment Expertise: Demonstrated experience deploying large-scale ML models with advanced inference and optimization practices.
- Cloud Infrastructure: Deep understanding of cloud platforms (preferably AWS or equivalents), including scalable architecture design and cost-efficient compute management.
- Programming: Proficient in Python, with experience in C/C++ for performance-critical applications.
- Tooling: Well-versed in MLOps tools such as MLflow, Kubeflow, or SageMaker Pipelines; strong working knowledge of Docker, Kubernetes, and distributed systems.
- Optimization: Familiarity with tools and frameworks for distributed training and inference such as DeepSpeed, FSDP, or Accelerate.
- Education: Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Engineering, or a related discipline.