Enable job alerts via email!

Site Reliability Engineer |, AI/ML Platform

JPMorgan Chase & Co.

Glasgow

On-site

GBP 60,000 - 100,000

Full time

15 days ago

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

An established industry player is seeking a Site Reliability Engineer III to join their AIML Data Platform Team. In this pivotal role, you will tackle complex challenges and enhance the reliability and scalability of mission-critical systems. Your expertise in Python and Infrastructure as Code will be crucial in optimizing applications and cloud infrastructure. This forward-thinking company values innovation and teamwork, offering a collaborative environment where you can mentor junior engineers and make a significant impact. If you're driven, self-managed, and passionate about technology, this opportunity is perfect for you.

Qualifications

  • Experience in Site Reliability Engineering with practical knowledge.
  • Strong expertise in reliability, scalability, and performance.

Responsibilities

  • Collaborate to enhance application availability and reliability.
  • Implement infrastructure as code for applications and platforms.

Skills

Site Reliability Engineering
Python Programming
Infrastructure as Code (Terraform)
Distributed Systems
Cloud-native Architecture (AWS)
Problem-solving
Communication Skills

Education

Formal Training in SRE
Certification in SRE Concepts

Tools

Terraform
Kubernetes
AI Ops Tools

Job description

Job Description

There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.


As a Site Reliability Engineer III at JPMorgan Chase within the AIML Data Platform Team, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform.

Job responsibilities
  • Collaborate with other software engineers and teams to design, develop, test, and implement solutions that enhance availability, reliability, and scalability of applications.
  • Implement infrastructure, configuration, and network as code for applications and platforms within your scope.
  • Understand service level indicators and utilize service level objectives to proactively resolve issues before they impact customers.
  • Design and implement solutions to improve the reliability and scalability of AI/ML platforms and applications to meet growing demands.
  • Partner with product engineering teams to ensure AI/ML systems are reliable and high-performing.
  • Develop observability, security, automation, and fin-ops tools and orchestration.
  • Build strong cross-functional relationships to foster engagement across the organization and deliver solutions to user problems.
  • Debug and resolve issues in a production environment, identify root causes, and remediate.
  • Participate in on-call rotations, incident management, and escalation workflows.
  • Take full ownership of problems, develop solutions, and acquire new knowledge to complete tasks.
  • Mentor and guide junior engineers.
Required qualifications, capabilities, and skills
  • Formal training or certification in Site Reliability Engineering concepts and practical experience.
  • Expertise in SRE principles, and the reliability, scalability, and performance of applications and infrastructure.
  • Proficiency in Python programming and Infrastructure as Code tools such as Terraform.
  • Experience with distributed systems and cloud-native architecture in AWS.
  • Strong problem-solving and troubleshooting skills in complex systems.
  • Excellent communication skills, capable of presenting technical and business concepts to stakeholders.
  • Self-managed, motivated, with a strong sense of ownership, urgency, and drive.
Preferred qualifications, capabilities, and skills
  • Experience working in AI, ML, or Data engineering.
  • Expertise in container orchestration/Kubernetes.
  • Experience developing automation frameworks/AI Ops.
  • Experience building observability and telemetry tools.
About Us

J.P. Morgan is a global leader in financial services, providing strategic advice and products to the world's most prominent corporations, governments, wealthy individuals, and institutional investors. Our first-class business approach to serving clients drives everything we do. We strive to build trusted, long-term partnerships to help our clients achieve their business objectives.

We recognize that our people are our strength, and the diverse talents they bring to our global workforce are directly linked to our success. We are an equal opportunity employer and value diversity and inclusion. We do not discriminate based on protected attributes, and we accommodate religious practices, mental health, or physical disabilities. Visit our FAQs for more information about requesting accommodations.

About the Team

Our professionals in corporate functions cover areas from finance and risk to human resources and marketing. Our teams are essential in setting our businesses, clients, customers, and employees up for success.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.

Similar jobs

Site Reliability Engineer |, AI/ML Platform

TN United Kingdom

Glasgow

On-site

GBP 50,000 - 90,000

14 days ago

Site Reliability Engineer |, AI/ML Platform

J.P. MORGAN

Scotland

On-site

GBP 60,000 - 80,000

3 days ago
Be an early applicant