llmhallucinationverification

LLM Failure Detection with Search Verification

Verify LLM claims against live search results. Extract assertions, search for each claim, flag mismatches before hallucinations reach users.

5 min read

LLM hallucinations (fabricated pricing, invented features, outdated version numbers) erode user trust and can cause real harm in production. Automated detection verifies LLM-generated claims against live search results before the output reaches users. The approach: extract factual assertions, search for each claim, flag mismatches.

Why Search-Based Verification Works

Traditional hallucination detection relies on the model's own confidence scores, which are unreliable. A model can be confidently wrong. Search-based verification is external: if the LLM says "Tavily costs $50/mo" and a Google search for "Tavily pricing 2026" returns results saying "$30/mo", that is a verified mismatch regardless of the model's confidence.

The Verification Pipeline

Python
import requests, os, json

H = {"x-api-key": os.environ["SCAVIO_API_KEY"]}

def verify_claim(claim_text, search_query, expected_value):
    """Verify a factual claim against live search results."""
    r = requests.post("https://api.scavio.dev/api/v1/search", headers=H,
        json={"platform": "google", "query": search_query},
        timeout=10).json()
    snippets = " ".join(
        s.get("snippet", "") for s in r.get("organic", [])[:3])
    verified = expected_value.lower() in snippets.lower()
    return {
        "claim": claim_text,
        "verified": verified,
        "evidence": snippets[:300],
    }

claims = [
    {"text": "Tavily costs $30/mo",
     "query": "Tavily pricing 2026",
     "expected": "$30"},
    {"text": "Firecrawl Hobby plan is $16/mo",
     "query": "Firecrawl pricing 2026",
     "expected": "$16"},
]

for c in claims:
    result = verify_claim(c["text"], c["query"], c["expected"])
    status = "PASS" if result["verified"] else "FAIL"
    print(f"[{status}] {result['claim']}")

Cold Start for LLM Failure Data

Building a dataset of known LLM failures is a cold start problem. Instead of waiting for community contributions, pipe search API results as ground truth against LLM outputs automatically. If the LLM says "library X has function Y" and a search for current docs says otherwise, that is a verified failure data point. Automated verification scales better than manual crowdsourcing for building the initial dataset.

What to Verify

  • Pricing claims: any dollar amount attributed to a specific product or service
  • Version numbers: any specific version (v3.2, 2.0.1) claimed for software
  • Feature claims: "X supports Y" or "X integrates with Y"
  • Date-sensitive facts: "as of 2026" or "released in Q1"
  • Comparative claims: "X is faster/cheaper/better than Y"

The cost of verification is low: 1-3 search queries per claim at $0.005/credit. For a typical LLM response with 5 factual claims, verification costs $0.025-$0.075. Compare that to the cost of a customer making a decision based on a hallucinated pricing comparison.