optimizationtokensllm

Search API Token Budgets: Practical Guide

When to use tight token budgets vs loose ones for search-augmented LLMs. Math on cost impact per model tier.

5 min read

Every search result you feed into an LLM costs tokens. A single search returning 10 results with snippets adds roughly 1,500-3,000 tokens to your context window. Feed 5 searches into a research pipeline and you are burning 10,000-15,000 tokens before the model generates a single word. At GPT-4o input pricing ($2.50/M tokens), that is $0.025-$0.0375 in token costs on top of the search API cost. Token budgets control this.

Token budget tiers

  • Tight (under 2,000 tokens): simple Q&A, single search, top 3 results only. Cost: minimal.
  • Medium (2,000-8,000 tokens): multi-search with summarization. Good for most agent tasks.
  • Loose (8,000-30,000 tokens): deep research, multiple sources, full snippets. Expensive but thorough.

Controlling result count to manage tokens

Python
import requests, os

def search_with_budget(query: str, budget: str = "medium") -> list:
    """Search with token-aware result limits."""
    limits = {"tight": 3, "medium": 5, "loose": 10}
    num_results = limits.get(budget, 5)

    resp = requests.post(
        "https://api.scavio.dev/api/v1/search",
        headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
        json={"query": query, "num_results": num_results},
        timeout=10,
    )
    results = resp.json().get("results", [])

    if budget == "tight":
        # Only return titles and URLs, skip snippets
        return [{"title": r["title"], "url": r["url"]} for r in results]
    elif budget == "medium":
        # Include snippets but truncate to 150 chars
        return [
            {
                "title": r["title"],
                "url": r["url"],
                "snippet": r.get("snippet", "")[:150],
            }
            for r in results
        ]
    else:
        # Full results
        return results

Cost math by model tier

The total cost of a search-augmented query has two components: search API cost and LLM token cost. Here is the math for a medium-budget search (5 results, ~2,500 input tokens) plus a 500-token response.

Text
Model              | Input $/M  | Output $/M | Search tokens | Response | Total LLM cost | Search cost | Total
-------------------|------------|------------|---------------|----------|----------------|-------------|------
GPT-4o             | $2.50      | $10.00     | 2,500         | 500      | $0.0113        | $0.005      | $0.016
GPT-4o-mini        | $0.15      | $0.60      | 2,500         | 500      | $0.0007        | $0.005      | $0.006
Claude 3.5 Sonnet  | $3.00      | $15.00     | 2,500         | 500      | $0.0150        | $0.005      | $0.020
Claude 3.5 Haiku   | $0.80      | $4.00      | 2,500         | 500      | $0.0040        | $0.005      | $0.009
Llama 3.3 (Groq)   | $0.05      | $0.10      | 2,500         | 500      | $0.0002        | $0.005      | $0.005

When to use each budget tier

  1. Tight budget: factual lookups. "What is the current price of X?" One search, 3 results, title-only context. The LLM extracts the answer from minimal data.
  2. Medium budget: standard agent tasks. "Compare tools A and B." Two searches (one per tool), 5 results each with snippets. Enough context for a balanced answer.
  3. Loose budget: deep research. "Write a market analysis of AI search APIs in 2026." Five-plus searches across different angles, full snippets, multiple sources for cross-referencing.

Implementing a budget-aware pipeline

Python
import tiktoken

def estimate_tokens(results: list) -> int:
    """Estimate token count for search results."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    text = " ".join(
        f"{r['title']} {r.get('snippet', '')}" for r in results
    )
    return len(enc.encode(text))

def research_with_budget(query: str, max_tokens: int = 5000):
    """Search and trim results to fit within token budget."""
    results = search_with_budget(query, budget="loose")
    trimmed = []
    total_tokens = 0

    for r in results:
        result_tokens = estimate_tokens([r])
        if total_tokens + result_tokens > max_tokens:
            break
        trimmed.append(r)
        total_tokens += result_tokens

    return {
        "results": trimmed,
        "tokens_used": total_tokens,
        "results_included": len(trimmed),
        "results_dropped": len(results) - len(trimmed),
    }

report = research_with_budget("AI search API pricing 2026", max_tokens=3000)
print(f"Included {report['results_included']} results ({report['tokens_used']} tokens)")
print(f"Dropped {report['results_dropped']} results to stay within budget")

The hidden cost: multiple search rounds

Agent loops amplify costs. An agent that searches, reads results, decides it needs more info, and searches again can easily run 3-5 search rounds per task. Each round adds both search API credits and token costs. Cap the number of search rounds in your agent config to prevent runaway costs.

Practical recommendations

  • Default to medium budget for most agent tasks
  • Use tight budget for chatbot-style single-turn Q&A
  • Reserve loose budget for explicit research commands
  • Cap agent search rounds at 3 per task unless the user opts in to deep research
  • Log token usage per search to identify expensive queries and optimize prompts
  • With cheap models (GPT-4o-mini, Haiku), the search API cost dominates. With expensive models (GPT-4o, Sonnet), token cost dominates. Budget strategy should match your model choice.