Enable job alerts via email!

AI ML Lead Site Reliability Engineer

JPMorgan Chase & Co.

Glasgow

On-site

GBP 70,000 - 100,000

Full time

27 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Join a globally recognized firm as an AI ML Lead Site Reliability Engineer, where you will lead initiatives to enhance application reliability and mentor team members. This role offers the opportunity to influence technical practices and collaborate with product engineering teams to ensure high-performing AI/ML systems.

Qualifications

Experience in site reliability engineering concepts with practical experience.
Deep proficiency in reliability, scalability, performance, security, and enterprise system architecture.
Experience with observability tools and CI/CD tools.

Responsibilities

Champion site reliability culture and practices.
Lead initiatives to improve reliability and stability of applications.
Serve as the main contact during major incidents.

Skills

Reliability

Scalability

Performance

Security

Enterprise System Architecture

Python

Java Spring Boot

.Net

Observability Tools

CI/CD Tools

Containerization

Kubernetes

AWS

Education

Formal training or certification in site reliability engineering

Tools

Grafana

Dynatrace

Prometheus

Datadog

Splunk

Jenkins

GitLab

Terraform

Docker

ECS

Job Description

Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.

As an AI ML Lead Site Reliability Engineer at JPMorgan Chase within the AIML Data Platform Team, you will hold a leadership role, demonstrate strong knowledge across multiple technical domains, and advise others on technical and business issues. You will lead resiliency design reviews, break down complex problems for other engineers, act as a technical lead for medium to large-sized products, and mentor team members.

Job responsibilities

Demonstrate and champion site reliability culture and practices, exerting technical influence across your team.
Lead initiatives to improve the reliability and stability of applications and platforms using data-driven analytics to enhance service levels.
Collaborate with team members to identify service level indicators, define service level objectives, and establish error budgets with stakeholders.
Maintain high technical expertise in one or more domains, proactively resolving technology bottlenecks.
Serve as the main contact during major incidents, quickly identifying and resolving issues to prevent financial losses.
Partner with product engineering teams to ensure AI/ML systems are reliable and high-performing.
Develop observability, security, automation, and fin-ops tools and orchestration solutions.
Provide strategic technology leadership by defining standards and architectures for reliability and automation frameworks.
Build strong cross-functional relationships to deliver effective solutions.
Debug and resolve issues in production, identify root causes, and implement remediation.
Participate in on-call rotations, incident management, and escalation workflows.

Required qualifications, capabilities, and skills

Formal training or certification in site reliability engineering concepts with practical experience.
Deep proficiency in reliability, scalability, performance, security, and enterprise system architecture, with the ability to implement best practices.
Proficiency in at least one programming language such as Python, Java Spring Boot, or .Net.
Deep knowledge of software applications and technical processes, with emerging expertise in specific technical disciplines.
Experience with observability tools like Grafana, Dynatrace, Prometheus, Datadog, Splunk, including monitoring, SLO alerting, and telemetry collection.
Proficiency with CI/CD tools such as Jenkins, GitLab, Terraform.
Experience with containerization and orchestration tools like Docker, Kubernetes, ECS.
Expertise in SRE principles, application and infrastructure reliability, scalability, and performance.
Skill in programming with Python and Infrastructure as Code tools like Terraform.
Experience designing distributed systems and cloud-native architectures in AWS.
Self-motivated with a strong sense of ownership, urgency, and drive.

Preferred qualifications, capabilities, and skills

Experience in AI, ML, or Data engineering.
Expertise in Kubernetes and container orchestration.
Experience developing automation frameworks or AI Ops solutions.
Experience building observability and telemetry tools.

About Us

J.P. Morgan is a global leader in financial services, providing strategic advice and products to prominent clients worldwide. We value diversity and inclusion, and are committed to equal opportunity employment.

About the Team

Our corporate functions support areas from finance and risk to human resources and marketing, ensuring our company's success and long-term partnerships with clients.

Get your free, confidential resume review.

or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs