An r/Rag post asked which scraper to use for ~10M tokens. The cheaper, more reliable shape for indexed public content is search-as-source. This walks the recipe.
Prerequisites
- Scavio API key
- Python or Node
- Topic with 200-500 seed query candidates
- Embedding pipeline
Walkthrough
Step 1: Define 200-500 seed queries
Topical breadth > depth.
seeds = ['ai agent infrastructure 2026', 'agent memory patterns', 'tool use mcp', ...]Step 2: Scavio Google SERP per seed
Collect organic_results URLs.
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
urls = set()
for q in seeds:
r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
for o in (r.get('organic_results') or [])[:10]:
urls.add(o['link'])Step 3: Deduplicate URL set
Many seeds surface the same authoritative pages.
print(f'Unique URLs: {len(urls)}')Step 4: Scavio /extract on top URLs
Returns clean Markdown.
docs = []
for u in list(urls)[:2000]:
d = requests.post('https://api.scavio.dev/api/v1/extract', headers=H, json={'url': u}).json()
if d.get('text'): docs.append({'url': u, 'text': d['text']})Step 5: Token-budget trim
Stop at 10M tokens.
# Walk top-N until cumulative tokens hit 10M.Step 6: Embed and ship to vector store
Per existing pipeline.
# Voyage / OpenAI / Cohere → Pinecone / Qdrant / pgvector.Step 7: Quarterly refresh
Re-run + diff URL set.
# Cron: quarterly. Embed only new/changed pages.Python Example
# Total cost: ~11K credits ≈ $50-90 within Project tier.JavaScript Example
// Same shape in TS.Expected Output
10M-token RAG corpus from indexed public content. ~5K unique URLs → ~2K extracted → 8M tokens of clean Markdown.