I Tracked What AI Agents Actually Do When Nobody Is Watching

You deploy an AI agent that calls external tools -- search APIs, databases, third-party services. It works in development. Then in production, it starts returning bad answers, burning through API credits, or silently failing on edge cases. You have no idea why because you never instrumented the tool calls.

Observability for AI agents is not optional. This post covers what to track, how to structure your logs, and what patterns to watch for when your agent makes real-time search calls in production.

What Makes Agent Observability Different

Traditional application monitoring tracks request/response cycles. Agent observability tracks reasoning chains. A single user request might trigger 5-10 tool calls, each with its own latency, cost, and failure mode. You need to see the full chain, not just the final output.

The key metrics for agent tool calls are:

Tool call frequency: which tools get called and how often
Latency per call: how long each tool takes to respond
Data quality: whether the returned data was useful to the agent
Error rate: timeouts, rate limits, malformed responses
Cost per chain: total API credits consumed per user request

Wrapping Tool Calls With Telemetry

The simplest approach is a wrapper function that logs every tool call before and after execution:

interface ToolCallLog {
  traceId: string;
  tool: string;
  input: Record<string, unknown>;
  output: unknown;
  latencyMs: number;
  error?: string;
  timestamp: string;
}

async function trackedToolCall(
  traceId: string,
  tool: string,
  input: Record<string, unknown>,
  execute: () => Promise<unknown>
): Promise<unknown> {
  const start = Date.now();
  try {
    const result = await execute();
    const log: ToolCallLog = {
      traceId,
      tool,
      input,
      output: result,
      latencyMs: Date.now() - start,
      timestamp: new Date().toISOString()
    };
    await persistLog(log);
    return result;
  } catch (err) {
    const log: ToolCallLog = {
      traceId,
      tool,
      input,
      output: null,
      latencyMs: Date.now() - start,
      error: String(err),
      timestamp: new Date().toISOString()
    };
    await persistLog(log);
    throw err;
  }
}

Tracking Search Quality

When your agent calls a search API like Scavio, the raw response is structured JSON with organic results, knowledge graph data, and related questions. But not every search returns useful data. Track these quality signals:

Result count: did the search return enough results?
Relevance: did the agent use the results or ignore them?
Retries: did the agent rephrase the query and search again?
Fallbacks: did the agent switch to a different tool?

function assessSearchQuality(query: string, results: any) {
  return {
    query,
    resultCount: results.organic?.length ?? 0,
    hasKnowledgeGraph: !!results.knowledgeGraph,
    hasPeopleAlsoAsk: (results.peopleAlsoAsk?.length ?? 0) > 0,
    topResultRelevance: results.organic?.[0]?.snippet?.length > 50
  };
}

Detecting Failure Patterns

After collecting logs for a few days, look for these patterns:

Repeated searches: the agent searches for the same thing multiple times in one chain, indicating the results were not useful
Tool call spirals: the agent calls 10+ tools for a simple question, suggesting a prompt or parsing issue
Silent failures: the tool returns an error but the agent generates a confident-sounding answer anyway
Cost spikes: certain user queries trigger disproportionately expensive tool call chains

Building a Dashboard

At minimum, your agent dashboard should show:

Total tool calls per hour/day with breakdown by tool
P50 and P99 latency for each tool
Error rate trends
Cost per user request (average and outliers)
Most common queries and their result quality scores

You do not need a dedicated observability platform to start. A structured JSON log piped to any time-series database gives you everything above. Start collecting data before you need it. Always include a trace ID that links every tool call back to the original user request -- without it, you cannot reconstruct what happened.

Observability for AI agents is not optional. This post covers what to track, how to structure your logs, and what patterns to watch for when your agent makes real-time search calls in production.

What Makes Agent Observability Different

The key metrics for agent tool calls are:

Tool call frequency: which tools get called and how often
Latency per call: how long each tool takes to respond
Data quality: whether the returned data was useful to the agent
Error rate: timeouts, rate limits, malformed responses
Cost per chain: total API credits consumed per user request

Wrapping Tool Calls With Telemetry

The simplest approach is a wrapper function that logs every tool call before and after execution:

interface ToolCallLog {
  traceId: string;
  tool: string;
  input: Record<string, unknown>;
  output: unknown;
  latencyMs: number;
  error?: string;
  timestamp: string;
}

async function trackedToolCall(
  traceId: string,
  tool: string,
  input: Record<string, unknown>,
  execute: () => Promise<unknown>
): Promise<unknown> {
  const start = Date.now();
  try {
    const result = await execute();
    const log: ToolCallLog = {
      traceId,
      tool,
      input,
      output: result,
      latencyMs: Date.now() - start,
      timestamp: new Date().toISOString()
    };
    await persistLog(log);
    return result;
  } catch (err) {
    const log: ToolCallLog = {
      traceId,
      tool,
      input,
      output: null,
      latencyMs: Date.now() - start,
      error: String(err),
      timestamp: new Date().toISOString()
    };
    await persistLog(log);
    throw err;
  }
}

Tracking Search Quality

Result count: did the search return enough results?
Relevance: did the agent use the results or ignore them?
Retries: did the agent rephrase the query and search again?
Fallbacks: did the agent switch to a different tool?

function assessSearchQuality(query: string, results: any) {
  return {
    query,
    resultCount: results.organic?.length ?? 0,
    hasKnowledgeGraph: !!results.knowledgeGraph,
    hasPeopleAlsoAsk: (results.peopleAlsoAsk?.length ?? 0) > 0,
    topResultRelevance: results.organic?.[0]?.snippet?.length > 50
  };
}

Detecting Failure Patterns

After collecting logs for a few days, look for these patterns:

Repeated searches: the agent searches for the same thing multiple times in one chain, indicating the results were not useful
Tool call spirals: the agent calls 10+ tools for a simple question, suggesting a prompt or parsing issue
Silent failures: the tool returns an error but the agent generates a confident-sounding answer anyway
Cost spikes: certain user queries trigger disproportionately expensive tool call chains

Building a Dashboard

At minimum, your agent dashboard should show:

Total tool calls per hour/day with breakdown by tool
P50 and P99 latency for each tool
Error rate trends
Cost per user request (average and outliers)
Most common queries and their result quality scores

I Tracked What AI Agents Actually Do When Nobody Is Watching

What Makes Agent Observability Different

Wrapping Tool Calls With Telemetry

Tracking Search Quality

Detecting Failure Patterns

Building a Dashboard

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters

I Tracked What AI Agents Actually Do When Nobody Is Watching

What Makes Agent Observability Different

Wrapping Tool Calls With Telemetry

Tracking Search Quality

Detecting Failure Patterns

Building a Dashboard

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters