Work

Agentic Customer Support

An internal support assistant that runs a ticket through a multi-stage pipeline and then resolves it with a streaming agent. The agent calls tools against a mirrored knowledge base, retrieves context, and proposes actions that a human approves before they run against the team's systems.

Next.js
Vertex AI (Gemini)
pgvector
GCP

Markus Lehmann

m.lehmann@example.com

Open

Can't access my contract documents after login

Since this morning I get an error opening my contracts. I need them for a signing today. This is urgent.

Support assistant

Message the assistant…

A scripted walkthrough of one ticket, from inbox to resolved. Schematic; hover the chips for detail.

A ticket lands from the support inbox. The assistant enriches it first (it scores sentiment, matches the customer, and assigns a category), then a support agent asks it to resolve the case. It pulls the ticket and the customer, reasons over the context it retrieves, and drafts a reply. Nothing leaves the system until a human approves it, and then the ticket moves to done.

Retrieval works two ways. The ticket context is always injected, and on top of that the agent searches the mirrored knowledge base when it needs to. It only reads from a synced database with vector search, so it never queries the external systems live, and every write action is held back as a preview for a human to approve. Each turn is traced end to end, down to the tools it called, the sources it used, and the tokens it spent.

Evals & experiments

Shipping an agent is straightforward. The harder problem is telling whether a change actually improved it. So every prompt change runs against a curated goldset and gets a score before it ships, instead of relying on a gut read. The walkthrough below follows the email-draft prompt from a new version to a go/no-go decision.

Eval experimentemail-draft · run #17

No judge model — deterministic scoring only.

Scoring run…

A scripted eval run for the email-draft prompt: prompt, goldset, deterministic checks, and a per-check version comparison. Pass rates are illustrative. Hover the labels for detail.

Prompts are versioned in Langfuse with caching and a local fallback, so they change without a deploy and every version stays recoverable. Each one runs against a goldset of real-shaped cases, and the scorer fits the task. An LLM-as-judge grades open answers for correctness and faithfulness. Structured drafts get checked by code instead: the right greeting, the phrases that have to be there, the ones that must not, and no leaked template examples.

A new version only ever gets compared against the one in production, on the same goldset, and every run is traced. That way the decision to ship rests on the numbers rather than a hunch.