Your agent is skipping its tools, and your latency dashboard loves it

An agent skips its tools when it answers from parametric memory instead of calling the search or knowledge-base tool you gave it. The skipped call is faster, returns no error, and shows up green on your latency dashboard. It's also wrong, because the model guessed instead of grounding. The fix is deterministic: log every tool-call trace, assert the tool was actually called for queries that need fresh data, and fail the eval when no call happened.

Why latency dashboards hide skipped tool calls

A tool call adds round-trip time. Hitting POST /api/v1/google, waiting for results, then feeding them back to the model costs you a few hundred milliseconds and a credit. When the agent skips that and answers straight from its weights, the request is faster and cheaper. Your p50 drops. Error rate stays at zero, because guessing isn't an error the runtime can see.

So the dashboard rewards the exact failure you care about. A fact query that should have triggered a search but didn't looks identical to a chit-chat turn that legitimately needed no tool. Latency, error rate, and token count can't tell them apart. The signal you need is whether the tool fired, and that lives in the trace, not the metrics.

Validate tool use deterministically, not with an LLM judge

For the question "did the agent call the tool," a deterministic check beats an LLM judge. The judge is another model that can hallucinate, costs tokens, and is slow. But you already have ground truth: the tool-call trace. Either a call to your search tool is in the trace or it isn't. Assert on that.

The stronger check has two parts. First, assert the tool was called. Second, assert the final answer actually references something the tool returned, so the agent can't call the tool and then ignore it. With Scavio's Google endpoint you get back organic results with real URLs, so you can check the answer cites one of them.

Python

import requests

API_KEY = "sk-your-key"

def search(query):
    r = requests.post(
        "https://api.scavio.dev/api/v1/google",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"query": query, "light_request": False},
    )
    r.raise_for_status()
    return r.json()

def assert_grounded(query, agent_answer, tool_calls):
    # 1) deterministic: the search tool was actually called
    assert any(c["name"] == "search" for c in tool_calls), \
        f"agent skipped search for fact query: {query!r}"

    # 2) deterministic: the answer cites a URL the tool returned
    results = search(query)
    returned_urls = [item["link"] for item in results.get("organic", [])]
    assert any(url in agent_answer for url in returned_urls), \
        "answer does not reference any retrieved source"

# example eval case
assert_grounded(
    query="latest stable python release",
    agent_answer="Python 3.13 is current. Source: https://www.python.org/downloads/",
    tool_calls=[{"name": "search", "args": {"query": "latest stable python release"}}],
)

No model in the loop. The check passes or fails on facts you can see.

Force grounding for fact queries

Detecting a skipped call after the fact is good for evals. Preventing it in production is better. For queries you know require fresh data, force the search call instead of leaving it to the model's judgment.

Most agent frameworks expose tool_choice. Set it to required (or name your search tool explicitly) for the fact-query path, so the model must hit POST /api/v1/google and answer from real organic results before it can respond. You're not asking the agent to decide whether grounding matters. You're deciding for it on the cases where you already know the answer can't come from training data.

This trades a little latency and a credit per call for an answer that's actually current. On a fact path, that's the trade you want every time.

The honest limitation

Deterministic tool-call validation beats an LLM judge for "did it call the tool," but it doesn't tell you which queries needed a tool in the first place. Someone has to label that. Your eval set needs cases marked as requiring a tool, so the assertion knows when a missing call is a bug versus when no tool was correctly needed.

That labeling is the real work. Build a set of fact queries where the right answer depends on current data, mark each as tool-required, and run the two-part assertion against every one. The deterministic check is cheap and exact once you have the labels. Getting the labels is the part no automation does for you.

Why latency dashboards hide skipped tool calls

Validate tool use deterministically, not with an LLM judge

Python

import requests

API_KEY = "sk-your-key"

def search(query):
    r = requests.post(
        "https://api.scavio.dev/api/v1/google",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"query": query, "light_request": False},
    )
    r.raise_for_status()
    return r.json()

def assert_grounded(query, agent_answer, tool_calls):
    # 1) deterministic: the search tool was actually called
    assert any(c["name"] == "search" for c in tool_calls), \
        f"agent skipped search for fact query: {query!r}"

    # 2) deterministic: the answer cites a URL the tool returned
    results = search(query)
    returned_urls = [item["link"] for item in results.get("organic", [])]
    assert any(url in agent_answer for url in returned_urls), \
        "answer does not reference any retrieved source"

# example eval case
assert_grounded(
    query="latest stable python release",
    agent_answer="Python 3.13 is current. Source: https://www.python.org/downloads/",
    tool_calls=[{"name": "search", "args": {"query": "latest stable python release"}}],
)

No model in the loop. The check passes or fails on facts you can see.

Force grounding for fact queries

This trades a little latency and a credit per call for an answer that's actually current. On a fact path, that's the trade you want every time.

The honest limitation

Your agent is skipping its tools, and your latency dashboard loves it

Why latency dashboards hide skipped tool calls

Validate tool use deterministically, not with an LLM judge

Force grounding for fact queries

The honest limitation

Continue reading

Your LLM Visibility Tracker Only Watches the Prompts You Gave It

Native LLM web search vs a search API tool: when to use each (2026)

Your agent is skipping its tools, and your latency dashboard loves it

Why latency dashboards hide skipped tool calls

Validate tool use deterministically, not with an LLM judge

Force grounding for fact queries

The honest limitation

Continue reading

Your LLM Visibility Tracker Only Watches the Prompts You Gave It

Native LLM web search vs a search API tool: when to use each (2026)