Social network you want to login/join with:
col-narrow-left
Client:
AI71
Location:
London, United Kingdom
Job Category:
Other
-
EU work permit required:
Yes
col-narrow-right
Job Views:
9
Posted:
12.05.2025
Expiry Date:
26.06.2025
col-wide
Job Description:
Location: Abu Dhabi, UAE (Full Relocation Provided)
Company: AI71
About Us
AI71 is an applied research team committed to building responsible and impactful AI agents that empower knowledge workers. In partnership with the Technology Innovation Institute (TII), we drive innovation through cutting-edge AI research and development. Our mission is to translate breakthroughs in machine learning into transformative products that reshape industries.
AI71 is seeking a Senior MLOps Engineer to lead the development and management of our infrastructure, designed for training, deploying, and maintaining ML models. This role plays a critical role in operationalizing state-of-the-art systems to ensure high-performance delivery across research and production environments.
The successful candidate will be responsible for designing and implementing infrastructure to support efficient model deployment, inference, monitoring, and retraining. This includes close collaboration with cross-functional teams to integrate machine learning models into scalable and secure production pipelines, enabling the delivery of real-time, data-driven solutions across various domains.
Key Responsibilities
- Model Deployment: Lead the deployment and scaling of LLMs and other deep learning models using inference engines such as vLLM, Triton, or TGI, ensuring optimal performance and reliability.
- Pipeline Engineering: Design and maintain automated pipelines for model finetuning, evaluation, versioning, and continuous delivery using tools like MLflow, SageMaker Pipelines, or Kubeflow.
- Infrastructure Management: Architect and manage cloud-native, cost-effective infrastructure for machine learning workloads using AWS (SageMaker, EC2, EKS, Lambda) or equivalent platforms.
- Performance Optimization: Implement monitoring, logging, and optimization strategies to meet latency, throughput, and availability requirements across ML services.
- Collaboration: Work closely with ML researchers, data scientists, and engineers to support experimentation workflows, streamline deployment, and translate research prototypes into production-ready solutions.
- Automation & DevOps: Develop infrastructure-as-code (IaC) solutions to support repeatable, secure deployments and CI/CD for ML systems.
- Model Efficiency: Apply model optimization techniques such as quantization, pruning, and multi-GPU/distributed inference to enhance system performance.
Qualifications
- Professional Experience: Minimum 5 years of experience in MLOps, ML infrastructure, or machine learning engineering, with a strong record of managing end-to-end ML model lifecycles.
- Deployment Expertise: Proven experience in deploying large-scale models in production environments with advanced inference techniques.
- Cloud Proficiency: In-depth expertise in cloud services (preferably AWS), including infrastructure management, scaling, and cost optimization for ML workloads.
- Programming Skills: Strong proficiency in Python, with experience in C/C++ for performance-sensitive applications.
- Tooling Knowledge: Proficiency in MLOps frameworks such as MLflow, Kubeflow, or SageMaker Pipelines; familiarity with Docker and Kubernetes.
- Optimization Techniques: Hands-on experience with model performance optimization techniques and distributed training frameworks (e.g., DeepSpeed, FSDP, Accelerate).
- Educational Background: Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Engineering, or a related technical field.
Why Join AI71?
- Advanced Technology Stack: Work with some of the most capable large language models and cutting-edge ML infrastructure.
- High-Impact Work: Contribute directly to the deployment of AI solutions that deliver measurable business value across industries.
- Collaboration-Driven Environment: Engage with a high-performing, interdisciplinary team focused on continuous innovation.
- Robust Infrastructure: Access high-performance compute resources to support experimentation and scalable deployment.
- Relocation Package: Full support for relocation to Abu Dhabi, with a competitive compensation package and lifestyle benefits.