Overview
As an AI Infrastructure & MLOps Engineer at Müller’s Solutions for a 6-month contract, this role is primarily operations-focused (90%), with hands-on involvement in implementation, configuration, and setup of AI infrastructure and MLOps workflows. You will play a key role in managing, operating, and guiding the deployment of a strategic AI environment, working closely with the customer as a technical advisor and hands-on engineer.
Responsibilities
- Operate and maintain AI infrastructure and MLOps platforms in a production environment.
- Monitor, manage, and troubleshoot Kubernetes-based AI workloads.
- Perform Acceptance Testing Planning and Execution for AI infrastructure and platforms.
- Ensure stability, performance, and availability of AI systems.
- Support day-to-day operational tasks across compute, storage, and networking layers.
- Install and configure NVIDIA Enterprise AI Stack (NVAI).
- Configure and manage MLOps platforms such as Kubeflow and MLflow.
- Assist in setting up end-to-end AI workflows, including data pipelines.
- Support the initial implementation phase of the AI environment.
- Act as a technical guide and advisor to the customer during the early stages of their AI adoption.
Qualifications
- Technical Requirements
- AI / MLOps Stack: Proficient experience with the NVIDIA Enterprise AI Stack.
- Familiarity with Ubuntu Linux.
- Experience with Kubernetes.
- Knowledge of Kubeflow / MLflow.
- Experience with QFLOW (an open-source AI data pipeline management tool).
- Programming & Automation: 4–6 years of practical experience in Python, Jupyter Notebook / JupyterLab; competence in writing, testing, and maintaining operational scripts and AI workflows.
- Infrastructure Experience: Practical experience with enterprise infrastructure, including Dell PowerScale (5 nodes), XE Server (1 node), Dell R570 Servers (5 nodes), Dell Network Switches (2 switches), GPU-based AI servers (in a small-scale environment).
- Environment Overview: Initial implementation of AI; Compact configuration: 1 GPU server, 1 PowerScale; 5 control plane servers; opportunity to shape best practices from the ground up.
- Nice-to-have: Familiarity with data frameworks like Apache Spark or Hadoop for data processing.
- Nice-to-have: Understanding of ML model monitoring and logging practices to ensure system reliability.
- Nice-to-have: Experience with security best practices in AI systems.
To succeed in this role, it’s nice to have additional familiarity with data frameworks like Apache Spark or Hadoop, ML model monitoring and logging practices, and security best practices in AI systems.