Social network you want to login/join with:
You will be part of a team designing and building a Gen AI virtual agent to support customers and employees across multiple channels. You will build and run LLM-powered agentic experiences, owning the design, orchestration, MLOps, and continuous improvement.
Responsibilities
- Design & build client-specific GenAI/LLM virtual agents
- Enable the orchestration, management, and execution of AI-powered interactions through purpose-built AI agents
- Design, build, and maintain robust LLM-powered processing workflows
- Develop cutting-edge testing suites related to bespoke LLM performance metrics
- Implement CI/CD pipelines for ML/LLM: automated build/train/validate/deploy for chatbots and agent services
- Utilize Infrastructure as Code (Terraform/CloudFormation) to provision scalable cloud environments for training and real-time inference
- Implement observability practices: monitoring, drift detection, hallucination mitigation, SLOs, and alerting for model and service health
- Serve models at scale: containerized, auto-scaling environments (e.g., Kubernetes) with low-latency inference
- Manage data & model versioning; maintain a central model registry with lineage and rollback capabilities
- Deliver a live performance dashboard (intent accuracy, latency, error rates) and establish a retraining strategy
- Collaborate closely with product, engineering, and client stakeholders to foster innovation around frameworks and models
Qualifications / Experience
- Relevant primary level degree, ideally MSc or PhD
- Proven expertise in mathematics, classical ML algorithms, and deep knowledge of LLMs (prompting, fine-tuning, RAG/tool use, evaluation)
- Hands-on experience with AWS and Azure data/ML services (e.g., Bedrock, SageMaker, Azure OpenAI, Azure ML)
- Strong engineering skills: Python, APIs, containers, Git; CI/CD (GitHub Actions, Azure DevOps); IaC (Terraform, CloudFormation)
- Experience with scalable serving infrastructure: containerized, auto-scaling environments (e.g., Kubernetes) for low-latency model serving
- Workflow automation across the machine learning lifecycle: data ingestion, preprocessing, model retraining, deployment
- Development of live performance dashboards displaying key metrics such as intent accuracy, response latency, and error rates
- Management of a centralized model registry with versioning, lineage, and rollback capabilities
- Automated retraining workflows and documentation for model updates
- Experience with Kubernetes, inference optimization, caching, vector stores, and model registries
- Strong communication skills, stakeholder management, and ability to produce clear technical documentation and runbooks
Personal Attributes
- Integrity, stakeholder management, project management, familiarity with Agile methodologies, automation skills, data visualization and analysis capabilities