About Craft & ChapsAt Craft, we rethink productivity from first principles. Our products disappear into the background so people can do their life's work-fast, joyfully, and without friction.
Chaps is our new AI-first product, focused on turning a constellation of large-language-model agents into a seamless personal productivity assistant.
About the roleOur AI Product team is looking for an engineer who obsesses over
making multi-agent systems robust, observable, and continuously improving. You'll build the test harnesses, evaluation pipelines, and monitoring layers that keep dozens of collaborating agents on-task, on-budget, and on-time.
In practice, that means:
- Designing automated evals that exercise complete agent workflows-catching regressions before they reach users.
- Instrumenting every prompt, tool-call, and model hop with rich telemetry so we can trace root causes in minutes, not days.
- Creating feedback loops that turn logs, user ratings, and synthetic tests into better prompts and safer behaviors.
- Future-proofing agentic systems by allowing quality to evolve with LLM intelligence.
You will partner with product, research, and infra to ship an AI assistant users can trust-no surprises, no downtime.
What we're looking forYou must have:- Hands-on experience with LLM evaluation frameworks (e.g., OpenAI Evals, LangSmith, LLM-Harness) and a track record of turning eval results into product-ready gating.
- Observability chops-you've wired up tracing/metrics for distributed systems (OpenTelemetry, Prometheus, Grafana) and know how to set SLOs that actually matter.
- Prompt-engineering fluency-few-shot, function-calling, RAG orchestration-and an instinct for spotting ambiguity or jailbreak vectors.
- Production-grade Python/TypeScript skills and comfort shipping through CI/CD (GitHub Actions, Terraform, Docker/K8s).
- A bias for experimentation: you automate A/B tests, cost-latency trade-off studies, and rollback safeguards as part of the dev cycle.
It would be great if you have:- Experience scaling multi-agent planners or tool-using agents in real products.
- Familiarity with vector databases, semantic diff tooling, or RLHF/RLAIF pipelines.
- A knack for weaving human feedback (support tickets, thumbs-downs) into automated regression tests.
Our Culture- Think differently. We value novel ideas over legacy playbooks-and we give you room to explore.
- People first. You instrument systems so users never feel the bumps; you collaborate so teammates never feel stuck.
- Pragmatic craftsmanship. We ship fast, but we measure twice-data accuracy, latency budgets, and reliability all matter.
- Clear communicators. You translate metrics into stories that product managers and designers understand, sparking better decisions.
Join us if you want to make AI that works-every request, every time.