Work
Agentic Customer Support
An internal support assistant that runs a ticket through a multi-stage pipeline and then resolves it with a streaming agent. The agent calls tools against a mirrored knowledge base, retrieves context, and proposes actions that a human approves before they run against the team's systems.
- Next.js
- Vertex AI (Gemini)
- pgvector
- GCP
Markus Lehmann
m.lehmann@example.com
Can't access my contract documents after login
Since this morning I get an error opening my contracts. I need them for a signing today. This is urgent.
A ticket lands from the support inbox. The assistant enriches it first (it scores sentiment, matches the customer, and assigns a category), then a support agent asks it to resolve the case. It pulls the ticket and the customer, reasons over the context it retrieves, and drafts a reply. Nothing leaves the system until a human approves it, and then the ticket moves to done.
Retrieval works two ways. The ticket context is always injected, and on top of that the agent searches the mirrored knowledge base when it needs to. It only reads from a synced database with vector search, so it never queries the external systems live, and every write action is held back as a preview for a human to approve. Each turn is traced end to end, down to the tools it called, the sources it used, and the tokens it spent.
Evals & experiments
Shipping an agent is straightforward. The harder problem is telling whether a change actually improved it. So every prompt change runs against a curated goldset and gets a score before it ships, instead of relying on a gut read. The walkthrough below follows the email-draft prompt from a new version to a go/no-go decision.
Prompts are versioned in Langfuse with caching and a local fallback, so they change without a deploy and every version stays recoverable. Each one runs against a goldset of real-shaped cases, and the scorer fits the task. An LLM-as-judge grades open answers for correctness and faithfulness. Structured drafts get checked by code instead: the right greeting, the phrases that have to be there, the ones that must not, and no leaked template examples.
A new version only ever gets compared against the one in production, on the same goldset, and every run is traced. That way the decision to ship rests on the numbers rather than a hunch.