optimizationllmcost

Token Budget Search: Save 40% on LLM Costs

Raw search results dumped into LLM context waste 40-60% of tokens. Implement token budgets to cut costs without losing answer quality.

6 min read

Dumping raw search results into an LLM context window wastes 40-60% of your token budget on boilerplate: navigation menus, footers, duplicate snippets, HTML artifacts, and irrelevant metadata. At $3-15 per million input tokens for frontier models in 2026, this waste adds up fast. Implementing a token budget for search results can cut your LLM inference costs by 30-50% with no quality loss.

The math on wasted tokens

A typical search API returns 10 results with titles, URLs, snippets, and metadata. Raw, that is roughly 2,000-3,000 tokens. If you also scrape the pages and include full content, you are looking at 15,000-50,000 tokens per search. Most of that content is irrelevant to the actual question. A well-implemented token budget extracts only the useful parts and truncates to a fixed ceiling.

Before: raw results dumped into context

Python
import requests, os, json

H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'

def naive_search_context(query: str) -> str:
    """The wasteful approach: dump everything into context."""
    resp = requests.post(URL, headers=H,
        json={'platform': 'google', 'query': query}, timeout=15)
    data = resp.json()

    # Dumping the entire response as context
    raw_context = json.dumps(data, indent=2)
    token_estimate = len(raw_context) // 4  # rough estimate
    print(f"Raw context: ~{token_estimate} tokens")
    return raw_context

# This typically produces 2,500-4,000 tokens of context
# ~60% is metadata, URLs, and structure the LLM does not need
context = naive_search_context('best database for time series data 2026')

After: token-budgeted extraction

Python
import requests, os

H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'

def budgeted_search_context(query: str, token_budget: int = 800) -> str:
    """Extract only essential info, truncate to budget."""
    resp = requests.post(URL, headers=H,
        json={'platform': 'google', 'query': query}, timeout=15)
    results = resp.json().get('organic_results', [])

    context_parts = []
    tokens_used = 0

    for r in results:
        # Extract only title and snippet (skip URL, metadata, etc.)
        title = r.get('title', '')
        snippet = r.get('snippet', '')
        entry = f"- {title}: {snippet}"
        entry_tokens = len(entry) // 4

        if tokens_used + entry_tokens > token_budget:
            break

        context_parts.append(entry)
        tokens_used += entry_tokens

    context = '\n'.join(context_parts)
    print(f"Budgeted context: ~{tokens_used} tokens ({len(context_parts)} results)")
    return context

# This produces ~600-800 tokens of high-signal context
context = budgeted_search_context('best database for time series data 2026')

Real cost savings

Here is the math for an agent making 100 search-grounded LLM calls per day using Claude Sonnet at $3 per million input tokens:

  • Without budget: 3,500 tokens/search x 100 calls x 30 days = 10.5M tokens/mo = $31.50/mo in input costs
  • With 800-token budget: 800 tokens/search x 100 calls x 30 days = 2.4M tokens/mo = $7.20/mo in input costs
  • Monthly savings: $24.30 (77% reduction in search-context input costs)

For GPT-4o at $5 per million input tokens, the savings are even larger: $40.50/mo without budget vs $12/mo with budget.

Advanced: relevance-weighted extraction

Instead of taking results in order, score each result by relevance to the query and include the highest-scoring results within your budget.

Python
import requests, os
from difflib import SequenceMatcher

H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'

def relevance_budgeted_context(query: str, token_budget: int = 800) -> str:
    """Score results by relevance, include best within budget."""
    resp = requests.post(URL, headers=H,
        json={'platform': 'google', 'query': query}, timeout=15)
    results = resp.json().get('organic_results', [])

    # Score each result by query-snippet similarity
    scored = []
    for r in results:
        snippet = r.get('snippet', '')
        score = SequenceMatcher(None, query.lower(), snippet.lower()).ratio()
        scored.append((score, r))

    # Sort by relevance, highest first
    scored.sort(key=lambda x: x[0], reverse=True)

    context_parts = []
    tokens_used = 0
    for score, r in scored:
        entry = f"- {r.get('title', '')}: {r.get('snippet', '')}"
        entry_tokens = len(entry) // 4
        if tokens_used + entry_tokens > token_budget:
            break
        context_parts.append(entry)
        tokens_used += entry_tokens

    return '\n'.join(context_parts)

Including AI Overview as high-signal context

Google AI Overviews are already a summary of the top results. When available, they are the most token-efficient search context. Use the AI Overview as primary context and fall back to organic results only when it is absent.

Python
import requests, os

H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'

def smart_context(query: str, token_budget: int = 800) -> str:
    resp = requests.post(URL, headers=H,
        json={'platform': 'google', 'query': query}, timeout=15)
    data = resp.json()

    # Prefer AI Overview: already summarized, high-signal
    ai_overview = data.get('ai_overview', {}).get('text', '')
    if ai_overview and len(ai_overview) // 4 <= token_budget:
        return f"AI Overview: {ai_overview}"

    # Fallback: organic results with budget
    results = data.get('organic_results', [])
    parts = []
    used = 0
    for r in results:
        entry = f"- {r.get('title', '')}: {r.get('snippet', '')}"
        t = len(entry) // 4
        if used + t > token_budget:
            break
        parts.append(entry)
        used += t
    return '\n'.join(parts)

The tradeoff

Token budgets can cut off relevant information. An aggressive 400-token budget might miss the sixth result that had the actual answer. Start with 800 tokens and monitor answer quality. If your LLM outputs degrade, increase the budget. If they stay the same, decrease it. The sweet spot is different for every use case. The point is to be intentional about how much context you inject, not to minimize at all costs.