optimizationtokensllm

Search API Token Budgets: Practical Guide

When to use tight token budgets vs loose ones for search-augmented LLMs. Math on cost impact per model tier.

May 6, 2026

5 min read

Every search result you feed into an LLM costs tokens. A single search returning 10 results with snippets adds roughly 1,500-3,000 tokens to your context window. Feed 5 searches into a research pipeline and you are burning 10,000-15,000 tokens before the model generates a single word. At GPT-4o input pricing ($2.50/M tokens), that is $0.025-$0.0375 in token costs on top of the search API cost. Token budgets control this.

Token budget tiers

Tight (under 2,000 tokens): simple Q&A, single search, top 3 results only. Cost: minimal.
Medium (2,000-8,000 tokens): multi-search with summarization. Good for most agent tasks.
Loose (8,000-30,000 tokens): deep research, multiple sources, full snippets. Expensive but thorough.

Controlling result count to manage tokens

Python

import requests, os

def search_with_budget(query: str, budget: str = "medium") -> list:
    """Search with token-aware result limits."""
    limits = {"tight": 3, "medium": 5, "loose": 10}
    num_results = limits.get(budget, 5)

    resp = requests.post(
        "https://api.scavio.dev/api/v1/search",
        headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
        json={"query": query, "num_results": num_results},
        timeout=10,
    )
    results = resp.json().get("results", [])

    if budget == "tight":
        # Only return titles and URLs, skip snippets
        return [{"title": r["title"], "url": r["url"]} for r in results]
    elif budget == "medium":
        # Include snippets but truncate to 150 chars
        return [
            {
                "title": r["title"],
                "url": r["url"],
                "snippet": r.get("snippet", "")[:150],
            }
            for r in results
        ]
    else:
        # Full results
        return results

Cost math by model tier

The total cost of a search-augmented query has two components: search API cost and LLM token cost. Here is the math for a medium-budget search (5 results, ~2,500 input tokens) plus a 500-token response.

Text

Model              | Input $/M  | Output $/M | Search tokens | Response | Total LLM cost | Search cost | Total
-------------------|------------|------------|---------------|----------|----------------|-------------|------
GPT-4o             | $2.50      | $10.00     | 2,500         | 500      | $0.0113        | $0.005      | $0.016
GPT-4o-mini        | $0.15      | $0.60      | 2,500         | 500      | $0.0007        | $0.005      | $0.006
Claude 3.5 Sonnet  | $3.00      | $15.00     | 2,500         | 500      | $0.0150        | $0.005      | $0.020
Claude 3.5 Haiku   | $0.80      | $4.00      | 2,500         | 500      | $0.0040        | $0.005      | $0.009
Llama 3.3 (Groq)   | $0.05      | $0.10      | 2,500         | 500      | $0.0002        | $0.005      | $0.005

When to use each budget tier

Tight budget: factual lookups. "What is the current price of X?" One search, 3 results, title-only context. The LLM extracts the answer from minimal data.
Medium budget: standard agent tasks. "Compare tools A and B." Two searches (one per tool), 5 results each with snippets. Enough context for a balanced answer.
Loose budget: deep research. "Write a market analysis of AI search APIs in 2026." Five-plus searches across different angles, full snippets, multiple sources for cross-referencing.

Implementing a budget-aware pipeline

Python

import tiktoken

def estimate_tokens(results: list) -> int:
    """Estimate token count for search results."""
    enc = tiktoken.encoding_for_model("gpt-4o")
    text = " ".join(
        f"{r['title']} {r.get('snippet', '')}" for r in results
    )
    return len(enc.encode(text))

def research_with_budget(query: str, max_tokens: int = 5000):
    """Search and trim results to fit within token budget."""
    results = search_with_budget(query, budget="loose")
    trimmed = []
    total_tokens = 0

    for r in results:
        result_tokens = estimate_tokens([r])
        if total_tokens + result_tokens > max_tokens:
            break
        trimmed.append(r)
        total_tokens += result_tokens

    return {
        "results": trimmed,
        "tokens_used": total_tokens,
        "results_included": len(trimmed),
        "results_dropped": len(results) - len(trimmed),
    }

report = research_with_budget("AI search API pricing 2026", max_tokens=3000)
print(f"Included {report['results_included']} results ({report['tokens_used']} tokens)")
print(f"Dropped {report['results_dropped']} results to stay within budget")

The hidden cost: multiple search rounds

Agent loops amplify costs. An agent that searches, reads results, decides it needs more info, and searches again can easily run 3-5 search rounds per task. Each round adds both search API credits and token costs. Cap the number of search rounds in your agent config to prevent runaway costs.

Practical recommendations

Default to medium budget for most agent tasks
Use tight budget for chatbot-style single-turn Q&A
Reserve loose budget for explicit research commands
Cap agent search rounds at 3 per task unless the user opts in to deep research
Log token usage per search to identify expensive queries and optimize prompts
With cheap models (GPT-4o-mini, Haiku), the search API cost dominates. With expensive models (GPT-4o, Sonnet), token cost dominates. Budget strategy should match your model choice.