nb.
Writing
8 min read

Ground-truth leakage in agentic evals

  • LLM evals
  • RAG
  • agents
  • open source

Why agentic evals leak by default

A classic ML benchmark hands the model a fixed input. You control exactly what it sees, so leakage is a dataset-curation problem you can eyeball.

Agentic evals break that assumption. The model fetches its own context: it calls a tool, the tool returns a record, and the answer you're testing for often rides along inside that record. You're no longer grading “can the model derive X from content.” You're grading “can the model copy X out of a field it was handed for free.”

Here's the trap with a synthetic example (the same one I use in the open-source skill below). Say the eval claims to measure: can the model derive a ticket's priority from its content? The agent calls get_ticket, which returns:

{
  "summary": "Login broken after deploy",
  "description": "Users can't sign in since 14:00 ...",
  "priority": "High",            // <- the answer, handed over directly
  "sla": "P1",                   // <- set together with priority; implies it
  "aiCategory": "Billing/High",  // <- an earlier classifier already guessed it
  "comments": [{ "body": "we agreed this is high priority" }] // <- in prose
}

Four different ways the answer reaches the model, and only the first is obvious.

The four leak channels

The mental test for every field the model sees is one sentence:

If yes, it's a leak. There are four channels it travels through:

ChannelWhat it is
Direct(obvious)The target value itself.
Correlated / derivedA different field set alongside the target, or computed from it.
Upstream outputAn earlier pipeline stage's prediction, still attached to the record.
Free-text mentionThe answer stated in prose, in every format the content exists in.

Everyone catches Direct. The other three are where evals quietly inflate:

  • Correlated is domain knowledge, not pattern matching. You have to know that your CS team always sets sla: P1 next to priority: High. No generic scanner finds this for you. You enumerate the co-varying values by hand.
  • Upstream outputis the nastiest in a mature system: you've already got a classifier writing predictions onto your records, and now your “fresh” eval context carries last month's model output. The answer string might be absent and it still leaks, because a paraphrase implies it.
  • Free-text mention has a structural trap. You redact the markdown body, feel safe, and miss that the same text still sits in the parsed rich-text tree (ADF/JSON) the model also receives. Clean one representation, forget the other.

Two checks, because one isn't enough

Most people, once they suspect a leak, sanitize and watch the eval score drop. That's necessary but not proof. A score can drop from noise, prompt changes, or you closing only half the channels. I use two complementary checks instead.

1. A deterministic static scan (no model calls).For every dataset item, scan the exact projection the model reads: the flattened context and the structured object, walked recursively so you get a JSON path for each hit. It's cheap, repeatable, and tells you which channel in which field leaks. The reusable core is ~30 lines:

import { auditItem } from "leak-audit/leak_scan";

const result = auditItem({
  expected: "High",
  fields: ticketReturnedByTool,             // walked recursively, reports paths
  correlatedValues: ["P1", "Billing/High"], // domain knowledge, you supply these
});
// → { leaked: true, findings: [{ channel: "direct", location: "priority" }, ...] }

One detail that matters more than it looks: whole-token matching for short labels. If your answer is a size M, a naive substring search flags the word “Medium,” “management,” “format”, false positives everywhere. Single-token answers must match on token boundaries; only multi-word answers match as phrases. Getting this wrong makes the scanner cry wolf, and you'll stop trusting it.

The static scan gives you a clean assertion to run in CI: raw projection leaks on every item; sanitized projection leaks on zero. If the sanitized side ever lights up, you missed a channel.

2. A blind-baseline smoke test (the cheap proof).Run a baseline that genuinely cannot do the task (no reasoning, or no real input) against the eval. It should score near chance: for ~4 labels, ~0.25. If it scores 0.8, the answer is reachable without reasoning. This is the check that caught me, and the one I'd keep if I could only keep one, because it tests the thing you actually care about. Can the answer be shortcut? The static scan tells you where. The blind baseline tells you whether it still matters after you sanitize.

A correctly fixed eval shows a meaningful score drop on the real model and a blind baseline back at chance. If the baseline still passes, you have not found all the channels, so go back to the scan.

Sanitizing without breaking the test

Closing leaks sounds like “delete the field.” It's more delicate, because you're operating on the thing you're measuring:

  • Preserve the observation signature.Keep the tool's name, description, and output schema byte-for-byte identical. Blank the values, don't change the shape. If get_ticket suddenly returns a different structure, the model behaves differently and your eval no longer reflects production.
  • Redact the span, not the document. Drop the leaking sentence or field; keep the rest. You still want the model reasoning over real content, since a blanked ticket tests nothing.
  • Walk every format. Plain text and the structured tree of the same content.
  • Emit redaction stats per item. Record what you stripped, so a reviewer can audit the sanitization itself.

I extracted the reusable part and open-sourced it

The leak taxonomy, the sanitization rules, and the zero-dependency scanner aren't specific to my system, so I pulled them out of the eval and published them as an agent skill: a Markdown playbook (the four channels, the workflow, the anti-patterns) plus leak_scan.ts with auditItem, scanText, scanFields, and summarizeBlindBaseline. No dependencies, drops into any eval harness.

It's structured as a skill on purpose: it's a checklist a coding agent (or a teammate) can follow step by step the next time someone says “this eval scores suspiciously high.” Run the self-check to see all four channels detected and the blind-baseline aggregator flag a too-high pass rate:

npx tsx scripts/leak_scan.ts   # self-check: direct, correlated, nested-path, token-safety

Repo and skill: github.com/nbialk/eval-skills.

The checklist

Next time you build or inherit a RAG/agent eval:

  1. Name the target in one sentence: “derive X from the content.”
  2. List every surface the model sees, not just the prompt: input fields, retrieved context, tool outputs, upstream-stage outputs.
  3. Audit each surface against all four channels (direct, correlated, upstream, free-text). Scan structured and text formats.
  4. Run a blind baseline. Near chance = clean. High = you're leaking.
  5. Sanitize the span, keep the schema, emit stats.
  6. Re-run the baseline. Real model drops, blind model back at chance, or you missed a channel.