Enable job alerts via email!

AI Agent Reliability Engineer - Chaps

Craft Docs Limited, Inc.

London

On-site

GBP 60,000 - 95,000

Full time

24 days ago

Generate a tailored resume in minutes

Land an interview and earn more. Learn more

Start fresh or import an existing resume

Job summary

Craft Docs Limited is seeking an engineer to enhance its AI product by focusing on multi-agent systems. The role involves designing evaluations, implementing observability tools, and ensuring AI assistants operate seamlessly. Ideal candidates will have strong experience with LLM frameworks and production coding skills in Python and TypeScript. Join us in creating AI that consistently delivers trust and reliability.

Qualifications

  • Hands-on experience with LLM evaluation frameworks required.
  • Strong skills in observability and distributed systems.
  • Fluent in prompt engineering and comfortable with CI/CD pipelines.

Responsibilities

  • Design automated evaluations for multi-agent workflows.
  • Implement telemetry for tracing and monitoring.
  • Create feedback loops using user input and logs.

Skills

LLM evaluation frameworks
Observability
Prompt engineering
Production-grade Python
Production-grade TypeScript
Experimentation

Tools

OpenTelemetry
Prometheus
Grafana
GitHub Actions
Terraform
Docker
Kubernetes

Job description

About Craft & Chaps

At Craft, we rethink productivity from first principles. Our products disappear into the background so people can do their life's work-fast, joyfully, and without friction.

Chaps is our new AI-first product, focused on turning a constellation of large-language-model agents into a seamless personal productivity assistant.

About the role

Our AI Product team is looking for an engineer who obsesses over making multi-agent systems robust, observable, and continuously improving. You'll build the test harnesses, evaluation pipelines, and monitoring layers that keep dozens of collaborating agents on-task, on-budget, and on-time.

In practice, that means:
  • Designing automated evals that exercise complete agent workflows-catching regressions before they reach users.
  • Instrumenting every prompt, tool-call, and model hop with rich telemetry so we can trace root causes in minutes, not days.
  • Creating feedback loops that turn logs, user ratings, and synthetic tests into better prompts and safer behaviors.
  • Future-proofing agentic systems by allowing quality to evolve with LLM intelligence.
You will partner with product, research, and infra to ship an AI assistant users can trust-no surprises, no downtime.

What we're looking for

You must have:
  • Hands-on experience with LLM evaluation frameworks (e.g., OpenAI Evals, LangSmith, LLM-Harness) and a track record of turning eval results into product-ready gating.
  • Observability chops-you've wired up tracing/metrics for distributed systems (OpenTelemetry, Prometheus, Grafana) and know how to set SLOs that actually matter.
  • Prompt-engineering fluency-few-shot, function-calling, RAG orchestration-and an instinct for spotting ambiguity or jailbreak vectors.
  • Production-grade Python/TypeScript skills and comfort shipping through CI/CD (GitHub Actions, Terraform, Docker/K8s).
  • A bias for experimentation: you automate A/B tests, cost-latency trade-off studies, and rollback safeguards as part of the dev cycle.
It would be great if you have:
  • Experience scaling multi-agent planners or tool-using agents in real products.
  • Familiarity with vector databases, semantic diff tooling, or RLHF/RLAIF pipelines.
  • A knack for weaving human feedback (support tickets, thumbs-downs) into automated regression tests.
Our Culture
  • Think differently. We value novel ideas over legacy playbooks-and we give you room to explore.
  • People first. You instrument systems so users never feel the bumps; you collaborate so teammates never feel stuck.
  • Pragmatic craftsmanship. We ship fast, but we measure twice-data accuracy, latency budgets, and reliability all matter.
  • Clear communicators. You translate metrics into stories that product managers and designers understand, sparking better decisions.
Join us if you want to make AI that works-every request, every time.
Get your free, confidential resume review.
or drag and drop a PDF, DOC, DOCX, ODT, or PAGES file up to 5MB.