Token Budget Search: Save 40% on LLM Costs
Raw search results dumped into LLM context waste 40-60% of tokens. Implement token budgets to cut costs without losing answer quality.
Dumping raw search results into an LLM context window wastes 40-60% of your token budget on boilerplate: navigation menus, footers, duplicate snippets, HTML artifacts, and irrelevant metadata. At $3-15 per million input tokens for frontier models in 2026, this waste adds up fast. Implementing a token budget for search results can cut your LLM inference costs by 30-50% with no quality loss.
The math on wasted tokens
A typical search API returns 10 results with titles, URLs, snippets, and metadata. Raw, that is roughly 2,000-3,000 tokens. If you also scrape the pages and include full content, you are looking at 15,000-50,000 tokens per search. Most of that content is irrelevant to the actual question. A well-implemented token budget extracts only the useful parts and truncates to a fixed ceiling.
Before: raw results dumped into context
import requests, os, json
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'
def naive_search_context(query: str) -> str:
"""The wasteful approach: dump everything into context."""
resp = requests.post(URL, headers=H,
json={'platform': 'google', 'query': query}, timeout=15)
data = resp.json()
# Dumping the entire response as context
raw_context = json.dumps(data, indent=2)
token_estimate = len(raw_context) // 4 # rough estimate
print(f"Raw context: ~{token_estimate} tokens")
return raw_context
# This typically produces 2,500-4,000 tokens of context
# ~60% is metadata, URLs, and structure the LLM does not need
context = naive_search_context('best database for time series data 2026')After: token-budgeted extraction
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'
def budgeted_search_context(query: str, token_budget: int = 800) -> str:
"""Extract only essential info, truncate to budget."""
resp = requests.post(URL, headers=H,
json={'platform': 'google', 'query': query}, timeout=15)
results = resp.json().get('organic_results', [])
context_parts = []
tokens_used = 0
for r in results:
# Extract only title and snippet (skip URL, metadata, etc.)
title = r.get('title', '')
snippet = r.get('snippet', '')
entry = f"- {title}: {snippet}"
entry_tokens = len(entry) // 4
if tokens_used + entry_tokens > token_budget:
break
context_parts.append(entry)
tokens_used += entry_tokens
context = '\n'.join(context_parts)
print(f"Budgeted context: ~{tokens_used} tokens ({len(context_parts)} results)")
return context
# This produces ~600-800 tokens of high-signal context
context = budgeted_search_context('best database for time series data 2026')Real cost savings
Here is the math for an agent making 100 search-grounded LLM calls per day using Claude Sonnet at $3 per million input tokens:
- Without budget: 3,500 tokens/search x 100 calls x 30 days = 10.5M tokens/mo = $31.50/mo in input costs
- With 800-token budget: 800 tokens/search x 100 calls x 30 days = 2.4M tokens/mo = $7.20/mo in input costs
- Monthly savings: $24.30 (77% reduction in search-context input costs)
For GPT-4o at $5 per million input tokens, the savings are even larger: $40.50/mo without budget vs $12/mo with budget.
Advanced: relevance-weighted extraction
Instead of taking results in order, score each result by relevance to the query and include the highest-scoring results within your budget.
import requests, os
from difflib import SequenceMatcher
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'
def relevance_budgeted_context(query: str, token_budget: int = 800) -> str:
"""Score results by relevance, include best within budget."""
resp = requests.post(URL, headers=H,
json={'platform': 'google', 'query': query}, timeout=15)
results = resp.json().get('organic_results', [])
# Score each result by query-snippet similarity
scored = []
for r in results:
snippet = r.get('snippet', '')
score = SequenceMatcher(None, query.lower(), snippet.lower()).ratio()
scored.append((score, r))
# Sort by relevance, highest first
scored.sort(key=lambda x: x[0], reverse=True)
context_parts = []
tokens_used = 0
for score, r in scored:
entry = f"- {r.get('title', '')}: {r.get('snippet', '')}"
entry_tokens = len(entry) // 4
if tokens_used + entry_tokens > token_budget:
break
context_parts.append(entry)
tokens_used += entry_tokens
return '\n'.join(context_parts)Including AI Overview as high-signal context
Google AI Overviews are already a summary of the top results. When available, they are the most token-efficient search context. Use the AI Overview as primary context and fall back to organic results only when it is absent.
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
URL = 'https://api.scavio.dev/api/v1/search'
def smart_context(query: str, token_budget: int = 800) -> str:
resp = requests.post(URL, headers=H,
json={'platform': 'google', 'query': query}, timeout=15)
data = resp.json()
# Prefer AI Overview: already summarized, high-signal
ai_overview = data.get('ai_overview', {}).get('text', '')
if ai_overview and len(ai_overview) // 4 <= token_budget:
return f"AI Overview: {ai_overview}"
# Fallback: organic results with budget
results = data.get('organic_results', [])
parts = []
used = 0
for r in results:
entry = f"- {r.get('title', '')}: {r.get('snippet', '')}"
t = len(entry) // 4
if used + t > token_budget:
break
parts.append(entry)
used += t
return '\n'.join(parts)The tradeoff
Token budgets can cut off relevant information. An aggressive 400-token budget might miss the sixth result that had the actual answer. Start with 800 tokens and monitor answer quality. If your LLM outputs degrade, increase the budget. If they stay the same, decrease it. The sweet spot is different for every use case. The point is to be intentional about how much context you inject, not to minimize at all costs.