Tutorial

How to Build a 10M-Token RAG Corpus With Scavio (2026)

Search-as-source: 200 seed queries → Scavio Google → /extract top 2K → 8M tokens of clean Markdown. ~$50-90.

An r/Rag post asked which scraper to use for ~10M tokens. The cheaper, more reliable shape for indexed public content is search-as-source. This walks the recipe.

Prerequisites

  • Scavio API key
  • Python or Node
  • Topic with 200-500 seed query candidates
  • Embedding pipeline

Walkthrough

Step 1: Define 200-500 seed queries

Topical breadth > depth.

Python
seeds = ['ai agent infrastructure 2026', 'agent memory patterns', 'tool use mcp', ...]

Step 2: Scavio Google SERP per seed

Collect organic_results URLs.

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
urls = set()
for q in seeds:
    r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
    for o in (r.get('organic_results') or [])[:10]:
        urls.add(o['link'])

Step 3: Deduplicate URL set

Many seeds surface the same authoritative pages.

Python
print(f'Unique URLs: {len(urls)}')

Step 4: Scavio /extract on top URLs

Returns clean Markdown.

Python
docs = []
for u in list(urls)[:2000]:
    d = requests.post('https://api.scavio.dev/api/v1/extract', headers=H, json={'url': u}).json()
    if d.get('text'): docs.append({'url': u, 'text': d['text']})

Step 5: Token-budget trim

Stop at 10M tokens.

Python
# Walk top-N until cumulative tokens hit 10M.

Step 6: Embed and ship to vector store

Per existing pipeline.

Python
# Voyage / OpenAI / Cohere → Pinecone / Qdrant / pgvector.

Step 7: Quarterly refresh

Re-run + diff URL set.

Python
# Cron: quarterly. Embed only new/changed pages.

Python Example

Python
# Total cost: ~11K credits ≈ $50-90 within Project tier.

JavaScript Example

JavaScript
// Same shape in TS.

Expected Output

JSON
10M-token RAG corpus from indexed public content. ~5K unique URLs → ~2K extracted → 8M tokens of clean Markdown.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Scavio API key. Python or Node. Topic with 200-500 seed query candidates. Embedding pipeline. A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

Search-as-source: 200 seed queries → Scavio Google → /extract top 2K → 8M tokens of clean Markdown. ~$50-90.