How I Built Tooling to debug AI agents in production

Harshal
Apr 12
6 min read

The observability hub I used to debug agent behavior from the user’s view

I built tooling to debug a complex AI agent's behavior in production in a recent role. This post describes the problem, the implementation approach, and what worked or failed.

You need 6 minutes to read this.

Abstract of AI Agent Observability Tool for workflow automation agent

Problem Context

The challenge was that when my team and I were building an AI agent that builds workflow automations, it was difficult to identify or isolate what user inputs the agent could not handle. Across all the different user inputs, what patterns did we see? What were the most common errors the AI agent made? How could we look at user failure modes so that we could create evals for our AI agent?

We had access to LLM trace collection tools like LangSmith, but that was not sufficient. We needed to see what the user saw: their prompts, how the builder responded, and how the workflow changed over time. Our agent output was a response to the user and a structured workflow JSON. We needed that JSON in the user’s context instead of reading raw JSON manually or asking another model to interpret it in a vacuum. A quick glance at the product might have shown a stray node or a piece of user feedback, but debugging required tying that to full context. Read more at What AI Observability Needs to Capture

Solution Overview

The observability tool was internal. It combined prompts, workflow state, telemetry, and evaluation data so we could move from “something went wrong” to a concrete repro and fix.

It gathered the main signals in one place. We debugged from the user's actual experience instead of isolated traces.

It also sped up evaluations and iterating on the Agent context:

Review evaluation results in the tool instead of opening LangSmith.
Run evaluations locally from the web app by bridging the browser to a machine-local runner (see "Local eval runs from the hosted UI").
Test LLM-as-a-judge evaluation specifications locally before pushing them to LangSmith.

App Optimizations

Over time, I made the tool more efficient by optimizing the data fetching and rendering:

Progressive loading and lighter default views

Lazy-loaded heavy fields (such as large workflow payloads) so first content load was faster and less data loaded overall.
Progressive-loaded panels from left (high level) to right (detail).
Shipped a browse mode for fast scanning (lighter rows and key metadata). The default inspect view loaded full workflow detail when we needed it.

Queries, backends, and long-running fetches

Split SQL so most queries stayed fast, and kept a slow heavy path for slicing and filtering every dimension we needed.
The LangSmith read API was severely rate-limited. I moved to async ETL from LangSmith to BigQuery, then read eval results from BigQuery.
Set longer timeouts for heavy queries and showed elapsed time during long backend fetches.

Filters, cache, and remembered state

Grew from 1 filter to 20+ filters for user, workflow, and failure mode.
Cached data in the browser to cut repeat fetches.
Stored user preferences in browser local storage for the last-used filters and views.

Observability signals that actually helped

These signals shortened the loop between report and fix:

User and session context

User prompt and surrounding session context.
Feedback the user left in the product, linked to the same session.
Notes we attached (for example accidental stop versus needs help) so product context did not get lost.

Agent behavior

Agent steps, tool usage, and structured output relevant to the workflow.
Templates involved in generation when we wanted to improve template-driven behavior.
Errors, retries, and final status, plus taxonomy-style tags so we could prioritize edge cases.

Deep links

Before-and-after views of workflow state where it mattered for debugging.
Shortcuts into LLM trace tools (for example LangSmith) and into session recording tools (for example PostHog).
Made the deep links from the app sharable.

Rich filtering and query performance

I added filters for cohort-style questions (who activated, who churned), error types, workflow shape, and more. The goal was to replace ad-hoc SQL in a generic BI tool with a purpose-built surface for this agentic product. Problem discovery became a few clicks instead of a new query each time.

The data started as a single SQL path. As the number of filters grew, that SQL query became very slow (70 seconds). I split the logic into multiple query paths: a fast path for common views (less than 1 second) and a heavier path when we needed the full picture. That kept day-to-day browsing responsive while deep dives remained possible when needed. I evaluated the SQL query performance and usage patterns to optimize the app.

Closing the loop: evaluations and scoring

The UI sped up running and reviewing evals. Eval definitions still had to stay trustworthy. We kept one maintained record per eval for the user prompt and related metadata. n8n workflows handled create, edit, and delete so our store and LangSmith did not drift.

Scores next to the workflow mattered more than scores alone inside the trace vendor. After eval results flowed into BigQuery, we filtered and compared runs in bulk, then opened the prompt viewer when we needed the visual workflow.

We validated changes two ways: diff two selected runs for tight before-and-after checks, and session-level views when many runs belonged to one larger change. Warehouse-backed slices complemented the in-tool screens when we needed aggregate views with the same context.

Local eval runs from the hosted UI

The hosted app could not start shell commands on my machine. I still wanted one-click local runs from the same UI we used for everything else, with the same pre-filtered eval commands we already surfaced.

I shipped a small loop: a localhost listener that runs the eval CLI, plus a Chrome extension scoped to our app origin. A button in the app sent a narrowly shaped request. The extension forwarded it only for that site and that command format. The listener ran the eval and streamed progress back.

That pattern had more moving parts than pasting into a terminal. It kept the mental model in one place: reproduce from the viewer, iterate locally, and see status without leaving the tab. For similar bridges, spell out trust boundaries (origin, message shape) so convenience does not become a generic open door.

Small tools that paid off

Two examples where small tools improved the UX instead of sticking to a spreadsheet:

Node type lookup: Evaluations referred to node types, not display names to make the checks more explicit. I built a quick search helper sorted by how often types appeared, with heavier-used types easier to spot. That beat Ctrl+F across a spreadsheet or database for repeat work. I also added a user-level preference to remember the most used items.
Architecture map: The stack spanned a visual frontend, serverless backends, workflow automation, and warehouse SQL. I used my browser history to navigate to the right component to edit code. My team also wanted to isolate the failure to an endpoint anytime the overall app didn't work. So, I made a status page along with a simple map of pages, edge functions, and which workflows touched which part to make onboarding and edits faster for anyone touching the system.

Tech Stack

At a high level, the stack looked like this:

Frontend: React/Vite for the web app. Node/JavaScript for the web app backend.
Backend and data: Supabase as backend-for-frontend using edge functions and Supabase Storage as a cache. n8n for CRUD APIs on data.
Warehouse: BigQuery as the data source for customer usage.
Automation: n8n for pipelines that reacted to events (evaluations created, sync jobs, maintenance).
AI traces: LangSmith for traces and evaluation runs.
Analytics: Notion Database to track evaluation results and the evaluation dataset.
Other Platforms: PostHog for session replay, linked from the same hub when it helped. We linked to it, instead of building any integrations.
Local eval bridge: a Manifest V3 Chrome extension (content script and service worker) plus a small Node.js HTTP server on localhost that executed allowlisted actions. No separate framework in that layer: plain extension APIs and a tiny Node process, loaded unpacked for development.

Next Steps

Hard problems still need human judgment. The hub cut search time; it did not replace diagnosis.

This observability stack helped my team and I improve our AI agent, but not magically solve all problems.

I did not add other screenshots from the project, but happy to chat about it with you if it helps you.