Enable job alerts via email!

Principal Artificial Intelligence Engineer

BMO Financial Group

Toronto

On-site

CAD 143,000 - 268,000

Full time

Today

Be an early applicant

Job summary

A leading financial services provider in Toronto seeks a Principal Cloud AI Engineer to design and build cloud-native AI solutions. The role involves mentoring engineers, driving technical standards, and collaborating across teams to enhance enterprise AI capabilities. Ideal candidates have extensive experience with cloud systems and a strong foundation in software engineering and MLOps. A competitive salary and dynamic work environment are offered.

Benefits

Performance-based incentives

Discretionary bonuses

Health benefits

Qualifications

7+ years building large-scale distributed cloud systems.
5+ years hands-on with cloud, preferably Azure.
Strong background in AI/ML systems implementation.

Responsibilities

Design and implement robust CI/CD pipelines for AI/ML workloads.
Build and operate AI/ML cloud-native systems.
Lead complex discovery and solution design with stakeholders.

Skills

Cloud-native AI solutions

Software engineering in Python

ML/LLMOps

Designing and operating production ML/GenAI systems

Excellent communication

Education

Bachelor’s, Master’s, or PhD in Computer Science or related field

Tools

Azure

Kubernetes

Terraform

GitHub Actions

The Team

We accelerate BMO’s AI journey by building enterprise-grade, cloud-native AI solutions. Our team combines engineering excellence with cutting-edge AI to deliver scalable, secure, and responsible solutions that power business innovation across the bank.

The Impact

As a Principal Cloud AI Engineer, you are a hands-on technical developer who designs, builds, and scales cloud-native AI solutions and products. You help set engineering standards, establish patterns, mentor senior engineers, and partner with multiple teams to deliver resilient, governed, and cost-efficient AI at enterprise scale.

You will advance BMO’s Digital First strategy by:

Defining reference and production-grade solutions for AI/GenAI on cloud (Azure preferred; multi-cloud aware a bonus).
Building reusable, secure, and observable components (APIs, SDKs, microservices, pipelines).
Operationalizing LLMs and RAG with strong controls and Responsible AI guardrails.
Driving platform roadmaps that enable faster delivery, lower risk, and measurable business outcomes.

What’s In It for You

Influence the technical direction of enterprise AI and the platform primitives others build on.
Ship high-impact systems used across many business lines and products.
Work across the full stack: cloud infra, data/feature pipelines, model serving, LLMOps, and DevSecOps.
Partner with a leadership team invested in your growth and thought leadership.

Responsibilities

Product Builder

Build and operate AI/ML cloud-native systems: frontend, backend, integration to other systems, feature stores, training/serving infra, vector databases, model registries, CI/CD, canary/blue-green, and GitOps for AI.
Technical cloud-native implementation of ML/LLM observability (latency, cost, drift, hallucination/guardrails, quality & safety metrics), logging/tracing (OpenTelemetry), and SLOs/SLIs for production AI systems.
Design and implement robust CI/CD pipelines for AI/ML workloads using GitHub Actions and Azure DevOps, including automated testing, model validation, security scanning, model versioning, and blue/green or canary deployments to ensure safe, repeatable, and auditable releases.
Drive FinOps for AI/GPU workloads (rightsizing, autoscaling, spot, caching, inference optimization).

Strategy

Help evolve the cloud AI reference design (networking, security, data, serving, observability) for ML/GenAI workloads (batch, streaming, online) with HA/DR, multi-region patterns, and cost efficiency.
Work on standards and best practices for containerization, microservices, serverless, event-driven design, and API management for AI systems.

GenAI & LLMOps

Architect RAG systems (chunking, embeddings, vector stores, grounding, evaluation) and guardrail frameworks (prompt/content safety, PII redaction, jailbreak & injection defenses).
Lead model serving (LLMs and traditional ML) using performant runtimes (e.g., TensorRT-LLM, vLLM, Triton/KServe) and caching strategies; optimize token usage, throughput, and cost.
Guide fine-tuning/PEFT/LoRA strategies, evaluation frameworks (offline/online A/B), and safety/quality scorecards; standardize prompt libraries and prompt engineering patterns.

Security, Risk & Governance

Implement defense-in-depth: IAM least privilege, private networking, KMS/Key Vault, secrets mgmt, image signing/SBOM, policy-as-code (OPA/Azure Policy), and data sovereignty controls.
Embed Responsible AI: model documentation, lineage, explainability, fairness testing, and human-in-the-loop patterns; align to model risk management and audit needs.
Ensure regulatory and privacy compliance (e.g., PII handling, encryption in transit/at rest, approved data sources, retention & residency).

Delivery & Operations

Lead complex discovery and solution design with stakeholders; build strong business cases (value, feasibility, ROI).
Oversee production readiness and operate platforms with SRE principles (SLOs, error budgets, incident response, chaos testing, playbooks).
Mentor engineers; multiply team impact via reusable components, templates, and inner-source.

Qualifications

Must Have

Bachelor’s, Master’s, or PhD in Computer Science, Engineering, Mathematics, or related field (or equivalent experience).
7+ years building large-scale distributed cloud systems; 5+ years hands-on with cloud (Azure preferred; AWS/GCP nice to have).
Proven experience designing and operating production ML/GenAI systems (training, serving, monitoring) and shipping AI features at scale on cloud.
Strong software engineering in Python (and one of Go/Java/TypeScript); deep expertise with APIs, async patterns, and performance optimization.
Hands-on with MLOps/LLMOps: MLflow, KServe/Triton, Feast/feature stores, vector DBs (e.g., FAISS, Milvus, Pinecone, pgvector, Cosmos DB with vectors), orchestration (Airflow/Prefect), and CI/CD for ML (GitHub Actions/Azure DevOps).
Cloud-native stack: Kubernetes (AKS/EKS), containers, service mesh/ingress, serverless (Azure Functions/Lambda), IaC (Terraform/Bicep), secrets & key management, VNet/Private Link/peering.
GenAI production experience: RAG, evaluation, prompt engineering, fine-tuning/PEFT/LoRA, and integration with providers (e.g., Azure OpenAI/OpenAI, Anthropic, Google, open-source models via Hugging Face).
Excellent communication; ability to influence across engineering, product, security, and risk.

Nice to Have

GPU systems & inference optimization (CUDA/NCCL, TensorRT-LLM, vLLM, TGI); Ray/Spark/Databricks for distributed training/inference.
Observability: Prometheus/Grafana, OpenTelemetry, ML observability (e.g., WhyLabs, Arize), data quality (Great Expectations).
Event streaming and real-time systems (Kafka/Event Hubs), micro-batching, CQRS.
Search & knowledge systems (Elastic, OpenSearch, Knowledge Graphs).

Tech You’ll Use (Illustrative)

Cloud & Infra: Azure (AKS, Functions, App Service, Event Hubs, API Management, Key Vault, Private Link, Monitor), Terraform/Bicep, GitHub Actions/Azure DevOps.
AI/ML: Python, PyTorch, ONNX, MLflow, Hugging Face, LangChain/LangGraph, OpenAI/Azure OpenAI, Anthropic, vector DBs (FAISS/Milvus/Pinecone/pgvector/Cosmos DB vectors).
Serving & Ops: KServe/Triton, vLLM/TensorRT-LLM, Prometheus/Grafana, OpenTelemetry, Great Expectations, ArgoCD/GitOps, OPA/Azure Policy.
Data & Orchestration: Spark/Databricks, Ray, Airflow/Prefect, Kafka/Event Hubs, Feast/feature store patterns.

How You’ll Measure Success

Reliability & Performance: SLOs met for AI services (latency, availability, quality); scalable throughput and GPU/infra efficiency.
Security & Compliance: Zero critical findings; auditable lineage and model documentation; RAI controls consistently applied.
Developer Velocity: Time-to-first model and time-to-production reduced via reusable components and golden paths.
Business Impact: Clear ROI, adoption across lines of business, measurable customer/employee experience improvements.
Technical Leadership: Mentorship, architectural influence, and uplift across teams; strong cross-functional partnerships.

Notes

Additional responsibilities may be assigned based on your career growth ambitions and evolving enterprise needs.
This role is individual contributor senior technical leadership (Principal), driving impact through architecture, code, and influence rather than direct line management.

Salary: $103,200.00 - $192,000.00

Pay Type: Salaried

BMO Financial Group’s total compensation package will vary based on the pay type of the position and may include performance-based incentives, discretionary bonuses, as well as other perks and rewards.

About Us

At BMO we are driven by a shared Purpose: Boldly Grow the Good in business and life. It calls on us to create lasting, positive change for our customers, our communities and our people.

BMO is committed to an inclusive, equitable and accessible workplace. By learning from each other’s differences, we gain strength through our people and our perspectives.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Principal Artificial Intelligence Engineer

BMO Financial Group

Toronto

On-site

CAD 143,000 - 268,000

Full time

Job summary

Benefits

Qualifications

Responsibilities

Skills

Education

Tools

Company

Services

Free resources

Support

Principal Artificial Intelligence Engineer

BMO Financial Group

Toronto

On-site

CAD 143,000 - 268,000

Full time

Job summary

Benefits

Qualifications

Responsibilities

Skills

Education

Tools

Follow us

Company

Services

Free resources

Support