raggroundingsearch-api

RAG Search Grounding: Cost vs Quality Tradeoffs

Adding search grounding to RAG improves accuracy 30-50% for time-sensitive queries at $0.005-0.03 per query. When it's worth it and when it's not.

9 min

Adding search grounding to a RAG pipeline improves answer accuracy for time-sensitive queries by 30-50% but adds $0.005-0.03 per query in search API costs on top of your existing LLM and vector database costs. The tradeoff depends on whether your queries need current data or whether your existing knowledge base is sufficient.

RAG without search grounding

Standard RAG retrieves from your own vector database. This works well when the knowledge base is comprehensive and up to date. It fails when:

  • Users ask about events after your last ingestion run
  • The knowledge base has gaps in coverage
  • Answers require cross-referencing with external data

RAG with search grounding

Search-grounded RAG adds a web search step to supplement the vector database. The LLM receives both your internal documents and fresh web results, producing answers that combine your proprietary data with current external context.

Cost breakdown per query

Python
# Cost comparison: RAG vs Search-Grounded RAG
# Assumes Claude Sonnet 4 at $3/1M input tokens

# Standard RAG
vector_db_cost = 0.0001      # Pinecone/Weaviate per query
embedding_cost = 0.0001      # Embedding the query
rag_context_tokens = 2000    # ~5 chunks at 400 tokens
llm_input_cost = (rag_context_tokens / 1_000_000) * 3.0
standard_rag_total = vector_db_cost + embedding_cost + llm_input_cost
print(f"Standard RAG: ${standard_rag_total:.4f}/query")

# Search-Grounded RAG
search_api_cost = 0.005      # Scavio per query
search_context_tokens = 1000 # 5 results at 200 tokens
additional_llm_cost = (search_context_tokens / 1_000_000) * 3.0
search_grounded_total = standard_rag_total + search_api_cost + additional_llm_cost
print(f"Search-Grounded RAG: ${search_grounded_total:.4f}/query")

# Cost increase
increase = search_grounded_total - standard_rag_total
pct = (increase / standard_rag_total) * 100
print(f"Additional cost: ${increase:.4f}/query ({pct:.0f}% increase)")

# Monthly cost at 10K queries
print(f"\nMonthly at 10K queries:")
print(f"  Standard RAG: ${standard_rag_total * 10000:.2f}")
print(f"  Search-Grounded: ${search_grounded_total * 10000:.2f}")
print(f"  Difference: ${increase * 10000:.2f}")

Semantic search (Exa) vs keyword search (Scavio) for RAG

The choice of search provider affects RAG quality differently depending on query type:

  • Keyword search (Scavio, SerpAPI, Brave): returns results matching exact terms. Best for factual queries, current events, specific entities. "What is Vercel's current pricing?" works well
  • Semantic search (Exa): returns results matching meaning. Best for conceptual queries. "Approaches to serverless cost optimization" works well

For most production RAG systems, keyword search is the better default because users typically ask about specific things, not abstract concepts. Semantic search adds value as a supplement when keyword search returns low-relevance results.

LangChain integration: search-grounded RAG

Python
import os
import requests
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# 1. Vector store retriever (your internal knowledge)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = PineconeVectorStore(
    index_name="my-docs",
    embedding=embeddings,
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 2. Web search for fresh context
def web_search(query: str, num_results: int = 3) -> str:
    resp = requests.post(
        "https://api.scavio.dev/api/v1/search",
        headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
        json={"query": query, "num_results": num_results},
    )
    results = resp.json().get("organic_results", [])
    return "\n".join(
        f"- {r['title']}: {r['snippet']}" for r in results
    )

# 3. Combined RAG chain
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Answer using both the internal documents and web search results. "
     "Prefer internal documents for proprietary information. "
     "Prefer web results for current/time-sensitive information. "
     "Cite which source type you used."),
    ("user",
     "Internal docs:\n{internal_context}\n\n"
     "Web results:\n{web_context}\n\n"
     "Question: {question}"),
])

def search_grounded_rag(question: str) -> str:
    # Retrieve from both sources
    internal_docs = retriever.invoke(question)
    internal_context = "\n".join(doc.page_content for doc in internal_docs)
    web_context = web_search(question)

    # Generate answer
    chain = prompt | llm
    response = chain.invoke({
        "internal_context": internal_context,
        "web_context": web_context,
        "question": question,
    })
    return response.content

When to skip search grounding

Not every query needs web search. Save costs by adding search only when it adds value:

  • Skip: queries fully answerable from your knowledge base
  • Skip: queries about your own product or internal processes
  • Skip: historical or reference-type questions
  • Add: queries containing "current," "latest," "2026," or date references
  • Add: queries about competitors, market data, or external entities
  • Add: queries where the vector store returns low-confidence matches
Python
def conditional_search_rag(question: str) -> str:
    """Only invoke web search when vector results are insufficient."""
    internal_docs = retriever.invoke(question)
    scores = [doc.metadata.get("score", 0) for doc in internal_docs]
    avg_score = sum(scores) / len(scores) if scores else 0

    time_sensitive_keywords = ["current", "latest", "2026", "today", "now", "price"]
    needs_freshness = any(kw in question.lower() for kw in time_sensitive_keywords)

    internal_context = "\n".join(doc.page_content for doc in internal_docs)

    if avg_score < 0.7 or needs_freshness:
        web_context = web_search(question)
    else:
        web_context = "No web search performed -- internal docs sufficient."

    chain = prompt | llm
    response = chain.invoke({
        "internal_context": internal_context,
        "web_context": web_context,
        "question": question,
    })
    return response.content

Bottom line

Search grounding adds roughly $50/month per 10K queries but significantly improves accuracy for time-sensitive questions. The optimal approach is conditional: trigger web search only when the vector store confidence is low or the query signals freshness requirements. This gives you the quality benefit while keeping costs close to standard RAG for the majority of queries.