Principal AI/ML Infra and Ops Engineering - Remote
Responsibilities:
- Automation & DevOps: Implement automation across the infrastructure lifecycle, leveraging Infrastructure as Code (IaC) and DevOps principles to streamline deployment and management processes.
- Systems Monitoring & Performance Tuning: Develop and implement monitoring frameworks for infrastructure, identify areas for performance improvement, optimize systems, and ensure high availability.
- Continuous Support: Provide SRE support to geographically distributed users on the UAIS platform, respond to tickets, triage support, and liaise with customers.
- Disaster Recovery & Business Continuity: Design, test, and implement disaster recovery and business continuity plans to ensure minimal downtime and data integrity.
- Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.
- Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML infrastructure.