The DevOps Engineer will play a mission‑critical role owning the deployment, scalability, security, and reliability of AI systems and digital platforms. This role has a strong focus on LLM deployments, AI workloads, and cloud‑native infrastructure, ensuring that all AI and software systems operate with enterprise‑grade availability, performance, and compliance.
Key Responsibilities
CI/CD & Automation Engineering
- Design, build, and maintain CI/CD pipelines for AI models, LLM services, and software applications.
- Automate build, test, deployment, and environment configuration workflows to enable rapid and reliable releases.
AI & LLM Deployment Operations
- Deploy, operate, and scale AI systems, LLM APIs, inference workloads, and cloud‑based AI services.
- Ensure high availability, horizontal scalability, and low‑latency inference across all production environments.
Infrastructure, Reliability & Cost Optimization
- Monitor infrastructure performance, system health, and AI workloads using observability and monitoring tools.
- Optimize infrastructure for reliability, performance, and cloud cost efficiency.
Security, Compliance & Governance
- Implement and enforce security best practices, access controls, secrets management, and environment isolation.
- Ensure infrastructure and deployment processes align with national data governance, compliance, and cybersecurity standards.
Cross‑Functional Enablement
- Collaborate closely with AI Engineers, Full‑Stack Engineers, and Product teams to enable seamless, scalable deployments.
- Act as the primary technical owner for production reliability during mission‑critical deployments.
Documentation & Architecture Standards
- Maintain comprehensive documentation for DevOps workflows, system architecture, environments, and deployment standards.
- Ensure operational readiness, auditability, and knowledge transfer across teams.
Required Qualifications
- Minimum 5 years of hands‑on DevOps engineering experience in production environments.
- Mandatory: Proven experience deploying and operating AI systems and LLM‑based workloads in production.
- Strong hands‑on expertise with Docker, Kubernetes, CI/CD platforms, and cloud services.
- Experience with monitoring, observability, logging, and infrastructure‑as‑code (e.g., Terraform, similar tools).
- Strong understanding of networking, security, and cloud‑native architecture principles.
- Excellent troubleshooting and incident response capabilities in high‑availability systems.
Preferred Qualifications
- Experience with MLOps platforms such as MLflow, SageMaker, Vertex AI, or similar.
- Proven experience scaling AI and LLM applications in high‑traffic production environments.
- Exposure to AI model lifecycle management, retraining pipelines, and operational governance.
- Experience in government, regulated, or national‑scale enterprise environments.
KPIs & Deliverables
- Uptime, reliability, and stability of AI platforms and production systems.
- Deployment speed, automation maturity, and release reliability.
- Infrastructure performance, scalability, and cost optimization efficiency.
- Security posture and compliance readiness across all environments.
- Quality, completeness, and audit readiness of DevOps documentation and workflows.