agentsdataarchitecture

Scoped Data Beats More Data for AI Agents

Agents with access to everything produce unfocused results. Scoping to 3-5 high-quality sources improves accuracy and reduces token waste.

8 min

Giving an AI agent access to every data source available produces worse results than scoping it to 3-5 relevant sources. More data means more noise, higher token costs, longer latencies, and increased hallucination risk when the model tries to reconcile contradictory information from too many sources.

Why more data hurts agent performance

LLMs have finite context windows and finite attention. When you dump 20 search results from 5 platforms into context, the model spends most of its reasoning capacity deciding what to ignore. Irrelevant results dilute the signal. Contradictory results confuse the synthesis. The model defaults to hedging or averaging instead of giving a clear answer.

Research on retrieval-augmented generation consistently shows that precision (percentage of retrieved results that are relevant) matters more than recall (percentage of relevant results that are retrieved). Five highly relevant results outperform twenty mixed-quality results.

The noise problem in practice

Consider an agent tasked with analyzing a competitor's pricing. With unscoped access, it might pull:

  • The competitor's current pricing page (relevant)
  • A 2024 blog post about their old pricing (misleading)
  • A Reddit thread complaining about a price increase (partially relevant)
  • An affiliate review with inaccurate pricing (harmful)
  • A job posting mentioning the competitor (irrelevant)
  • A news article about the competitor's funding (tangential)

Only 1-2 of these results directly answer the question. The rest are noise that the model must filter. With scoped retrieval, you direct the agent to search the competitor's domain and one authoritative review site, returning 3 highly relevant results.

Scoped architecture pattern

Python
import requests, os

def scoped_search(query, scope):
    """Search within a defined scope of sources."""
    headers = {"x-api-key": os.environ["SCAVIO_API_KEY"]}
    base = "https://api.scavio.dev/api/v1/search"

    results = []
    for source in scope:
        scoped_query = f"{query} site:{source}" if source else query
        resp = requests.post(
            base, headers=headers,
            json={"query": scoped_query, "num_results": 2},
        )
        for r in resp.json().get("organic_results", []):
            r["source_scope"] = source
            results.append(r)

    return results

# Example: competitor pricing analysis
pricing_scope = [
    "competitor.com",       # Primary source
    "g2.com",               # Verified review site
    "reddit.com/r/SaaS",   # Community discussion
]
results = scoped_search("competitor pricing plans 2026", pricing_scope)

# Example: technical evaluation
tech_scope = [
    "docs.library.io",     # Official docs
    "github.com",           # Source code and issues
    "stackoverflow.com",    # Community answers
]
results = scoped_search("library X migration guide v3", tech_scope)

How to choose the right 3-5 sources

Source selection depends on the task type:

  • Product research: official site + 1 review aggregator + 1 community forum
  • Technical questions: official docs + GitHub + Stack Overflow
  • Market analysis: industry reports site + news source + competitor sites
  • Lead generation: LinkedIn (via search) + company site + industry directory
  • Trend tracking: Reddit + TikTok + Google Trends (via search)

Token cost impact

Scoped retrieval directly reduces token usage:

  • Unscoped: 20 results x 200 tokens = 4,000 tokens of context
  • Scoped: 6 results x 200 tokens = 1,200 tokens of context
  • Savings: 70% fewer tokens per search-grounded query

At Claude Sonnet 4 rates ($3/1M input tokens), that is $0.0084 saved per query. Over 10K queries/month, scoped retrieval saves $84/month in LLM costs alone, plus you get better answers.

Dynamic scoping with intent detection

Python
import requests, os

SCOPE_MAP = {
    "pricing": ["competitor.com", "g2.com", "capterra.com"],
    "technical": ["docs.python.org", "github.com", "stackoverflow.com"],
    "hiring": ["linkedin.com", "glassdoor.com", "levels.fyi"],
    "default": [],  # Unscoped web search as fallback
}

def detect_intent(query):
    """Simple keyword-based intent detection."""
    query_lower = query.lower()
    if any(w in query_lower for w in ["price", "cost", "plan", "tier"]):
        return "pricing"
    if any(w in query_lower for w in ["how to", "error", "install", "api"]):
        return "technical"
    if any(w in query_lower for w in ["hire", "salary", "jobs"]):
        return "hiring"
    return "default"

def smart_search(query, num_per_source=2):
    """Automatically scope search based on query intent."""
    intent = detect_intent(query)
    scope = SCOPE_MAP[intent]
    headers = {"x-api-key": os.environ["SCAVIO_API_KEY"]}
    base = "https://api.scavio.dev/api/v1/search"

    if not scope:
        resp = requests.post(
            base, headers=headers,
            json={"query": query, "num_results": 5},
        )
        return resp.json().get("organic_results", [])

    results = []
    for source in scope:
        resp = requests.post(
            base, headers=headers,
            json={"query": f"{query} site:{source}", "num_results": num_per_source},
        )
        results.extend(resp.json().get("organic_results", []))
    return results

When unscoped search is appropriate

Scoped search is not always better. Use unscoped search when:

  • The query is exploratory and you do not know which sources are relevant
  • You are discovering new information rather than validating existing knowledge
  • The topic is niche enough that scoping would return zero results

For most production agent tasks, though, the query intent is known in advance. Scope the retrieval to match the intent, and your agent will be faster, cheaper, and more accurate.