Use Promptfoo for eval-driven MCP tool design

ADR-0064 ACCEPTED · 2026-04-11

Context

The MCP server exposes 14 tools for trip planning. An LLM client reads tool names, descriptions, and parameter schemas, then decides which tool to call. Tool descriptions are UX for language models — if the description is ambiguous or the parameter burden is too high, the model picks the wrong tool. We had no way to measure this systematically or know which models worked well with our API surface.

The problem is analogous to browser compatibility testing: the server is fixed, the clients vary wildly, and you need a matrix showing what works where.

Decision

We use Promptfoo (CLI-based, YAML-configured LLM eval framework) to test tool selection across models. The eval suite lives in evals/ and is version-controlled alongside the server code.

100 scenarios across 9 files, organized by the six JTBD lifecycle moments (#141--#146): vague idea, narrowing down, booking, something changed, sharing, on-trip. Plus multi-turn complex workflows and version comprehension scenarios. Each scenario provides a natural-language user prompt and asserts the model calls the correct tool(s). Multi-turn scenarios chain tool calls to test sequencing (e.g., "save before changing" requires create_version then add_destination).

Model compatibility matrix tested against the real server instructions from server.rs:

Model	Accuracy
Claude Haiku 4.5	100% (91/91)
Claude Sonnet 4.6	99% (90/91)
GPT-5.4-mini	99% (90/91)
GPT-5.4 (full)	98% (89/91)
GPT-5.4-nano	95% (86/91)

Why Promptfoo over Langfuse: The project already has OTEL + Tempo for production observability. Langfuse would duplicate the tracing infrastructure. Promptfoo is YAML + CLI, runs locally, diffs between experiments, and lives in git. It was acquired by OpenAI in March 2026 but remains MIT licensed.

Consequences

The evals directly shaped the API surface. Small models pick tools by description text and parameter count, not system prompt rules. Specific changes driven by eval failures:

Made create_version's name parameter optional -- nano models failed when forced to invent a name for "save this"
Switched version tool descriptions from "save slot" language to "git branch" language -- models understood branching semantics better than slot semantics
Added explicit "Use this when the user says save" to create_version's description -- without it, nano models called list_trips for save requests

The iteration exposed a failure mode. When eval pass rates were low, I asked Claude to fix the system prompt. Claude rewrote it as a hardcoded lookup table mapping trigger phrases to tool names -- essentially overfitting to the eval scenarios. Pass rates jumped to 99%, but the prompt was useless for any input not in the test set. I caught this and forced a reset: the eval system prompt was synced back to the actual server.rs instructions, and the lookup table was removed. The commit history tells the story: evals: fix save->create_version failures by rewriting system prompt as lookup table followed by Sync eval system prompt to server.rs instructions, remove lookup table. Pass rates held at 95-100% on the real prompt after tool description improvements -- the lookup table was never needed.

The eval suite is a regression gate: change a tool description or the server instructions field, run the matrix, see what breaks before deploy.