llmdataverification

Crowdsourced LLM Failure Data: Solving the Cold Start

Build LLM hallucination datasets automatically. Pipe search results as ground truth against LLM outputs instead of waiting for community contributions.

5 min read

The cold start problem for LLM failure data maps well to what autonomous coding agents already generate. Instead of waiting for community contributions to build a hallucination dataset, pipe search API results as ground truth against LLM outputs. Automated verification scales better than manual crowdsourcing for the initial dataset.

The Cold Start Problem

Crowdsourced LLM failure datasets need contributions to be useful, but they need to be useful to attract contributions. A 68-cycle dataset is a solid seed, but growing it to thousands of verified failures requires either a large community or automated data generation. Automated verification solves the chicken-and-egg problem by generating verified failure data points without community participation.

Search as Ground Truth

The approach: prompt the LLM with factual questions, capture its answers, then verify each answer against live search results. When the LLM says "library X has function Y" and a search for current docs says otherwise, that is a verified failure data point with both the hallucination and the correct answer.

Python
import requests, os, json

H = {"x-api-key": os.environ["SCAVIO_API_KEY"]}

def generate_failure_data(llm_claim, search_query):
    """Verify an LLM claim against search results.
    Returns a failure record if mismatch detected."""
    r = requests.post("https://api.scavio.dev/api/v1/search",
        headers=H, json={"platform": "google", "query": search_query},
        timeout=10).json()
    snippets = " ".join(
        s.get("snippet", "") for s in r.get("organic", [])[:3])
    sources = [s.get("link") for s in r.get("organic", [])[:3]]
    return {
        "llm_claim": llm_claim,
        "search_query": search_query,
        "search_evidence": snippets[:500],
        "sources": sources,
        "timestamp": "2026-05-07",
    }

test_claims = [
    {"claim": "requests-html latest version is 0.10.0",
     "query": "requests-html latest version pypi 2026"},
    {"claim": "Tavily free tier is 500 searches/mo",
     "query": "Tavily free tier 2026"},
]

for tc in test_claims:
    record = generate_failure_data(tc["claim"], tc["query"])
    print(json.dumps(record, indent=2))

Scaling Beyond Manual Curation

The automated approach generates 100-500 verified data points per hour at $0.005/search. Run it against a list of commonly hallucinated topics (library versions, API pricing, feature availability) and you build a dataset that would take weeks of manual curation. The search results serve as both ground truth and evidence, making each data point self-documenting.

MCP Integration for Continuous Collection

For MCP-compatible agents (Claude Code, Cursor), the collection can happen passively. When the agent makes a factual claim during a coding session, a verification hook searches for the claim and logs mismatches. Over time, this builds an organic failure dataset from real coding agent sessions without manual effort.