I Tracked What AI Agents Actually Do When Nobody Is Watching
Observability for AI agents in production -- tracking tool calls, data quality, latency, and silent failures.
You deploy an AI agent that calls external tools -- search APIs, databases, third-party services. It works in development. Then in production, it starts returning bad answers, burning through API credits, or silently failing on edge cases. You have no idea why because you never instrumented the tool calls.
Observability for AI agents is not optional. This post covers what to track, how to structure your logs, and what patterns to watch for when your agent makes real-time search calls in production.
What Makes Agent Observability Different
Traditional application monitoring tracks request/response cycles. Agent observability tracks reasoning chains. A single user request might trigger 5-10 tool calls, each with its own latency, cost, and failure mode. You need to see the full chain, not just the final output.
The key metrics for agent tool calls are:
- Tool call frequency: which tools get called and how often
- Latency per call: how long each tool takes to respond
- Data quality: whether the returned data was useful to the agent
- Error rate: timeouts, rate limits, malformed responses
- Cost per chain: total API credits consumed per user request
Wrapping Tool Calls With Telemetry
The simplest approach is a wrapper function that logs every tool call before and after execution:
interface ToolCallLog {
traceId: string;
tool: string;
input: Record<string, unknown>;
output: unknown;
latencyMs: number;
error?: string;
timestamp: string;
}
async function trackedToolCall(
traceId: string,
tool: string,
input: Record<string, unknown>,
execute: () => Promise<unknown>
): Promise<unknown> {
const start = Date.now();
try {
const result = await execute();
const log: ToolCallLog = {
traceId,
tool,
input,
output: result,
latencyMs: Date.now() - start,
timestamp: new Date().toISOString()
};
await persistLog(log);
return result;
} catch (err) {
const log: ToolCallLog = {
traceId,
tool,
input,
output: null,
latencyMs: Date.now() - start,
error: String(err),
timestamp: new Date().toISOString()
};
await persistLog(log);
throw err;
}
}Tracking Search Quality
When your agent calls a search API like Scavio, the raw response is structured JSON with organic results, knowledge graph data, and related questions. But not every search returns useful data. Track these quality signals:
- Result count: did the search return enough results?
- Relevance: did the agent use the results or ignore them?
- Retries: did the agent rephrase the query and search again?
- Fallbacks: did the agent switch to a different tool?
function assessSearchQuality(query: string, results: any) {
return {
query,
resultCount: results.organic?.length ?? 0,
hasKnowledgeGraph: !!results.knowledgeGraph,
hasPeopleAlsoAsk: (results.peopleAlsoAsk?.length ?? 0) > 0,
topResultRelevance: results.organic?.[0]?.snippet?.length > 50
};
}Detecting Failure Patterns
After collecting logs for a few days, look for these patterns:
- Repeated searches: the agent searches for the same thing multiple times in one chain, indicating the results were not useful
- Tool call spirals: the agent calls 10+ tools for a simple question, suggesting a prompt or parsing issue
- Silent failures: the tool returns an error but the agent generates a confident-sounding answer anyway
- Cost spikes: certain user queries trigger disproportionately expensive tool call chains
Building a Dashboard
At minimum, your agent dashboard should show:
- Total tool calls per hour/day with breakdown by tool
- P50 and P99 latency for each tool
- Error rate trends
- Cost per user request (average and outliers)
- Most common queries and their result quality scores
You do not need a dedicated observability platform to start. A structured JSON log piped to any time-series database gives you everything above. Start collecting data before you need it. Always include a trace ID that links every tool call back to the original user request -- without it, you cannot reconstruct what happened.