What AI Observability Needs to Capture
- Harshal

- 4 hours ago
- 4 min read
Funnels, logs, and LLM traces combined cannot explain AI agent failures
You tell an AI coding agent to update a script. The agent reports success. You check your files and find that the agent changed the wrong file. What happened? If you only observe the prompt and output text, the run still looks successful. You can only evaluate the agent's performance by seeing the before and after file state and the UI context that guided the agent.
This article explains what to observe and why it matters. AI observability must connect user context, agent actions, and product state changes.
I wrote this article for product managers shipping AI features.
You need 4 minutes to read this.

Related:
What Observability Means in SaaS
The goal of observability, telemetry, and user research is to understand user journeys and pain points at scale. This is already known for SaaS Product Managers.
Traditional SaaS observability combines interviews, session recordings, support escalations, product analytics, and backend telemetry. This stack answers funnel and usage questions well, including where users click, where users drop, and which features users activate.

What AI Observability Needs to Capture
Traditional observability answers event questions, for example: "Did the user click this button?" AI observability must answer state questions: what context shaped the agent decision, what the agent changed, and what the user saw after each change.
Standard tools are necessary but incomplete for this job. Product analytics tools like Amplitude, Mixpanel, and Datadog are built for events and funnels. LLM trace tools like LangSmith and Braintrust are built for prompts, model calls, and tool calls. AI product debugging needs both layers connected to product state.
For AI features observability you need to capture these aspects that aren't fully captured by standard tools:
User context (input, visible UI, prior actions, and relevant product state)
Agent actions (reads, writes, tool calls, and user-visible outputs)
State transitions (before and after content, config, permissions, and execution status)
Without this loop, teams tune prompts but still miss product failures. Teams often need an internal tool or a dedicated pipeline to stitch these records together. Core takeaway: if your team cannot reconstruct user context, agent actions, and state changes, your team cannot reliably improve AI behavior.
Examples: What to Observe in Different AI Products
Use the same observability pattern across products: capture context, actions, and state changes.
For a coding agent, capture:
workspace context (open files and active surface, IDE or CLI)
agent context retrieval actions (search and command trail)
code before/after at each significant agent action

For a workflow automation agent, capture:
workflow graph and current step context
user and agent turns, including edits to steps
workflow before/after and sample execution data
For a Notion-like agent, capture:
user prompt, agent response, and selected section context
document or database before/after
recent user edits and sample data used in formula changes

End to End Example: Debugging a Failed AI Workflow Fix
A user asks a workflow-building agent to "fix the error." The agent reports success and says the workflow is ready. The user runs it, and the run still fails.
The product manager reviewed the incident step by step. They compared workflow state before and after the edit. They checked the exact error message and the failing node. They verified the sample input that triggered the failure. They then checked integration context for that user, including which credentials passed and which failed.
The investigation exposed the real issue. The agent saw only the workflow graph. The agent could not see integration health or the failing data path. The agent guessed a fix and edited the wrong part.
The team then added a product-level observability flow with five controls:
Capture workflow state before and after every agent edit.
Capture integration status before any agent fix attempt.
Capture node-level errors with sample payloads.
Require the agent to cite the exact node it changed and explain why.
Block success messages until a validation run passes.
How to Start AI Observability
Start with one common job, such as updating a workflow step or rewriting a document section. Capture the user prompt, visible context, and before/after state for that first action. Do not instrument the full agent at the start.
Own this layer as a product responsibility. Build an observability pipeline that integrates with your existing stack.
Making this actionable for product managers:
Start with failures that current logs cannot explain.
For agents that change state, start with the first agent action that changes state.
For copilots that advise or answer, start with the first agent response.
Expand in stages:
Stage 1, capture the prompt, visible context, and final output.
Stage 2, capture tool calls or agent actions.
Stage 3, capture before/after resource state.
Stage 4, add replay views or evaluator layers to review failures at scale.
My approach was to connect LangSmith, BigQuery (data lake), PostHog session recordings, Anthropic LLM-as-judge evaluations, and a product-specific viewer that recreated what users saw. The trade-off is added maintenance work. The payoff is faster debugging and better decisions on what to improve next.
I will share my implementation example in an upcoming article.
Related:


