Enable job alerts via email!

CoreAI Principal Customer Reliability Engineer

Keystone AI

Seattle (WA)

On-site

USD 235,000 - 280,000

Full time

4 days ago
Be an early applicant

Boost your interview chances

Create a job specific, tailored resume for higher success rate.

Job summary

Keystone AI is seeking a Principal Customer Reliability Engineer to enhance CoreAI software performance across various environments. This pivotal role combines engineering leadership with operational excellence, requiring extensive experience in software engineering and observability tools. You'll ensure system resilience and customer trust while managing critical production environments.

Qualifications

  • 15+ years of experience in software engineering, SRE, or related fields.
  • Strong proficiency in Python, Go, or equivalent languages.
  • Experience with observability tools like Datadog and Grafana.

Responsibilities

  • Lead reliability and observability systems for CoreAI software.
  • Design telemetry and monitoring pipelines for real-time system health.
  • Establish incident management processes for continuous improvement.

Skills

Python
Go
Observability tools
Operational excellence

Job description

CoreAI Principal Customer Reliability Engineer

Join to apply for the CoreAI Principal Customer Reliability Engineer role at Keystone AI

CoreAI Principal Customer Reliability Engineer

3 days ago Be among the first 25 applicants

Join to apply for the CoreAI Principal Customer Reliability Engineer role at Keystone AI

Keystone is a premier strategy, technology, and economics firm that delivers science-led AI

solutions for Fortune 500 companies. We design, deploy, and operate machine learning software that automates and optimizes complex operational and commercial decisions.

Our CoreAI Solutions Group includes world-class AI and ML practitioners with unmatched experience implementing large-scale, high-impact models that help enterprises make smarter decisions across manufacturing, supply chain, sales, and marketing. We bring transformative ideas to life—and ensure they scale, endure, and deliver measurable value.

We are based in Bellevue, WA and New York, Boston, San Francisco, and London.

Position Overview

Keystone is seeking a Principal Customer Reliability Engineering to take end-to-end ownership of how CoreAI software performs in the real world—across SaaS, managed services, and customer-deployed environments. This role is central to our mission of using science-led AI to drive measurable impact for the world’s most complex organizations.

Reporting directly to the Vice President of Engineering, this is a hands-on, high-leverage role where you’ll design, build, and own the systems that ensure our products are resilient, observable, and delivering value at scale. As our first hire in this function, you won’t just shape the strategy—you’ll execute it. You’ll move with urgency, build with intent, and bring engineering clarity to how our systems behave in production.

You’ll be responsible for developing robust telemetry, monitoring, and alerting pipelines that provide real-time visibility into system health and long-term insights into model performance, stability, and drift. These systems will be critical to maintaining trust, reliability, and fairness in high-impact production environments.

Our ideal candidate thrives at the intersection of software engineering and customer empathy, brings a scientific mindset to operational excellence, and leverages every available tool—from AI-native development environments to open source observability stacks—to deliver durable, high-performance solutions.

You’ll work closely with a cross-functional team of scientists, engineers, economists, and strategists to establish the foundation of a scalable, ethical, and production-grade AI platform built to adapt to real-world complexity.

What You Will Do

  • Lead reliability, observability, and operational feedback systems for CoreAI software
  • across SaaS, managed services, and customer-hosted environments.
  • Build from scratch the infrastructure that tracks uptime, latency, usage, errors, and model behavior—ensuring full visibility into live deployments.
  • Design telemetry, monitoring, and alerting pipelines to surface real-time system health

and long-term trends in model stability, fairness, and drift.

  • Establish and operate incident management and correction-of-error (CoE) processes that promote transparency, learning, and continuous improvement.
  • Create dashboards that turn raw telemetry into actionable insights for engineering,
  • science, and executive stakeholders.
  • Collaborate with platform, science, and deployment teams to embed reliability and

observability standards into the development lifecycle.

  • Leverage LLM-based development tools and automation to reduce friction, accelerate

delivery, and extend your own impact.

  • Codify operational playbooks and deployment patterns into repeatable practices that scale across customers.
  • Act as a cross-functional leader—bringing engineering truth to customer engagements,
  • product planning, and executive decision-making.

The Ideal Candidate

  • Combines strategic leadership with tactical execution—equally comfortable setting direction and writing code.
  • Brings a builder’s mindset, and thrives in a zero-to-one environment with urgency,
  • precision, and ownership.
  • Prioritizes customer outcomes and understands how system reliability translates to
  • business impact and trust.
  • Leverages AI-native tools and automation to build smarter and faster.
  • Upholds a high bar for engineering rigor, ethical responsibility, and long-term
  • maintainability in production systems.
  • Collaborates across disciplines and adapts quickly to change in a fast-moving, high-
  • accountability environment.

Minimum Qualifications

  • 15+ years of experience in software engineering, SRE, DevOps, or platform operations
  • Proven ownership of production systems in mission-critical, customer-facing
  • environments
  • Strong proficiency in Python, Go, or equivalent languages
  • Deep experience with observability tools (Datadog, Prometheus, Grafana,
  • OpenTelemetry)
  • Familiarity with AWS and cloud-native services; experience with hybrid or customer-
  • hosted environments preferred
  • Demonstrated success in leading incident response, root cause analysis, and CoE
  • workflows
  • Experience in early-stage or high-velocity environments

Preferred Qualifications

  • Experience with model operations (MLOps) or ML observability systems
  • Exposure to fairness monitoring or ethical considerations in live AI systems
  • Familiarity with reinforcement learning, probabilistic models, or stochastic performance monitoring
  • Hands-on use of AI-native development tools (e.g., GitHub Copilot, Codeium, etc.)

US Salary Range: $235,000 - $280,000, plus an annual discretionary bonus, 401k contribution, and competitive benefits package. Actual compensation within the range will depend upon the level the individual is hired into based on their skills, experience, qualifications.

At Keystone we believe diversity matters. At every level of our firm, we seek to advance and promote diversity, foster an inclusive culture, and ensure our colleagues have a deep sense of respect and belonging. If you are interested in growing your career with colleagues from varied backgrounds and cultures, consider Keystone Strategy.

Seniority level
  • Seniority level
    Mid-Senior level
Employment type
  • Employment type
    Full-time
Job function
  • Job function
    Engineering and Information Technology
  • Industries
    Business Consulting and Services

Referrals increase your chances of interviewing at Keystone AI by 2x

Get notified about new Reliability Engineer jobs in Seattle, WA.

Seattle, WA $111,000.00-$164,000.00 2 weeks ago

Site Reliability Engineer, Product - USDS

Seattle, WA $136,100.00-$235,200.00 1 week ago

Redmond, WA $139,000.00-$200,000.00 2 weeks ago

Operations Engineer, Deployment & Ramp, Mechatronics & Sustainable Packaging, Customer Experience
Site Reliability Engineer of Container Service
Data Center Services Mechanical Commissioning Engineer

Seattle, WA $91,300.00-$107,300.00 2 weeks ago

Bothell, WA $96,876.00-$126,836.00 3 days ago

Operations Engineer, Deployment & Ramp, Mechatronics & Sustainable Packaging, Customer Experience

Redmond, WA $139,000.00-$200,000.00 1 week ago

Seattle, WA $136,100.00-$235,200.00 2 days ago

Site Reliability Engineer (SRE, Remote US)

Seattle, WA $120,000.00-$160,000.00 3 months ago

Sr Hardware Reliability Engineer, Hardware Reliability Team, Hardware Engineering Services

Redmond, WA $125,000.00-$140,000.00 4 weeks ago

Software Engineer III, Site Reliability Engineering

Seattle, WA $141,000.00-$202,000.00 1 week ago

Electrical Test and Reliability Engineer (Starlink)
Innovation and Design Engineer, Worldwide Design Engineering

Bellevue, WA $110,000.00-$160,000.00 1 week ago

Seattle, WA $96,310.00-$134,833.66 4 days ago

Bangor, WA $83,104.00-$129,494.00 5 days ago

Seattle, WA $96,310.00-$134,833.66 1 week ago

Issaquah, WA $133,900.00-$200,900.00 3 months ago

Seattle, WA $80,000.00-$174,000.00 1 day ago

Seattle, WA $147,000.00-$208,000.00 2 weeks ago

Senior Engineer, Fuel Performance and Development
Mechanical Engineer, Satellite Payload (Starlink)
Warehouse Automation Design Engineer, Senior

Seattle, WA $140,959.00-$197,342.25 2 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.