Search API Token Budgets: Practical Guide
When to use tight token budgets vs loose ones for search-augmented LLMs. Math on cost impact per model tier.
Every search result you feed into an LLM costs tokens. A single search returning 10 results with snippets adds roughly 1,500-3,000 tokens to your context window. Feed 5 searches into a research pipeline and you are burning 10,000-15,000 tokens before the model generates a single word. At GPT-4o input pricing ($2.50/M tokens), that is $0.025-$0.0375 in token costs on top of the search API cost. Token budgets control this.
Token budget tiers
- Tight (under 2,000 tokens): simple Q&A, single search, top 3 results only. Cost: minimal.
- Medium (2,000-8,000 tokens): multi-search with summarization. Good for most agent tasks.
- Loose (8,000-30,000 tokens): deep research, multiple sources, full snippets. Expensive but thorough.
Controlling result count to manage tokens
import requests, os
def search_with_budget(query: str, budget: str = "medium") -> list:
"""Search with token-aware result limits."""
limits = {"tight": 3, "medium": 5, "loose": 10}
num_results = limits.get(budget, 5)
resp = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
json={"query": query, "num_results": num_results},
timeout=10,
)
results = resp.json().get("results", [])
if budget == "tight":
# Only return titles and URLs, skip snippets
return [{"title": r["title"], "url": r["url"]} for r in results]
elif budget == "medium":
# Include snippets but truncate to 150 chars
return [
{
"title": r["title"],
"url": r["url"],
"snippet": r.get("snippet", "")[:150],
}
for r in results
]
else:
# Full results
return resultsCost math by model tier
The total cost of a search-augmented query has two components: search API cost and LLM token cost. Here is the math for a medium-budget search (5 results, ~2,500 input tokens) plus a 500-token response.
Model | Input $/M | Output $/M | Search tokens | Response | Total LLM cost | Search cost | Total
-------------------|------------|------------|---------------|----------|----------------|-------------|------
GPT-4o | $2.50 | $10.00 | 2,500 | 500 | $0.0113 | $0.005 | $0.016
GPT-4o-mini | $0.15 | $0.60 | 2,500 | 500 | $0.0007 | $0.005 | $0.006
Claude 3.5 Sonnet | $3.00 | $15.00 | 2,500 | 500 | $0.0150 | $0.005 | $0.020
Claude 3.5 Haiku | $0.80 | $4.00 | 2,500 | 500 | $0.0040 | $0.005 | $0.009
Llama 3.3 (Groq) | $0.05 | $0.10 | 2,500 | 500 | $0.0002 | $0.005 | $0.005When to use each budget tier
- Tight budget: factual lookups. "What is the current price of X?" One search, 3 results, title-only context. The LLM extracts the answer from minimal data.
- Medium budget: standard agent tasks. "Compare tools A and B." Two searches (one per tool), 5 results each with snippets. Enough context for a balanced answer.
- Loose budget: deep research. "Write a market analysis of AI search APIs in 2026." Five-plus searches across different angles, full snippets, multiple sources for cross-referencing.
Implementing a budget-aware pipeline
import tiktoken
def estimate_tokens(results: list) -> int:
"""Estimate token count for search results."""
enc = tiktoken.encoding_for_model("gpt-4o")
text = " ".join(
f"{r['title']} {r.get('snippet', '')}" for r in results
)
return len(enc.encode(text))
def research_with_budget(query: str, max_tokens: int = 5000):
"""Search and trim results to fit within token budget."""
results = search_with_budget(query, budget="loose")
trimmed = []
total_tokens = 0
for r in results:
result_tokens = estimate_tokens([r])
if total_tokens + result_tokens > max_tokens:
break
trimmed.append(r)
total_tokens += result_tokens
return {
"results": trimmed,
"tokens_used": total_tokens,
"results_included": len(trimmed),
"results_dropped": len(results) - len(trimmed),
}
report = research_with_budget("AI search API pricing 2026", max_tokens=3000)
print(f"Included {report['results_included']} results ({report['tokens_used']} tokens)")
print(f"Dropped {report['results_dropped']} results to stay within budget")The hidden cost: multiple search rounds
Agent loops amplify costs. An agent that searches, reads results, decides it needs more info, and searches again can easily run 3-5 search rounds per task. Each round adds both search API credits and token costs. Cap the number of search rounds in your agent config to prevent runaway costs.
Practical recommendations
- Default to medium budget for most agent tasks
- Use tight budget for chatbot-style single-turn Q&A
- Reserve loose budget for explicit research commands
- Cap agent search rounds at 3 per task unless the user opts in to deep research
- Log token usage per search to identify expensive queries and optimize prompts
- With cheap models (GPT-4o-mini, Haiku), the search API cost dominates. With expensive models (GPT-4o, Sonnet), token cost dominates. Budget strategy should match your model choice.