Overview
Search-as-source workflow for building a 10M-token RAG corpus from indexed public content. Avoids most scraper pain.
Trigger
Per topic build (one-shot or quarterly refresh)
Schedule
Per topic (one-shot or quarterly)
Workflow Steps
Define 200-500 seed queries covering the topic
Topical breadth > depth on individual queries.
Scavio Google SERP per seed
Collect organic_results URLs.
Deduplicate URL set
Many seeds surface the same authoritative pages.
Scavio /extract on top-2K URLs
Returns clean Markdown text.
Token-budget trim
Stop at 10M tokens; prefer URLs with higher domain authority.
Embed and ship to vector store
Per your existing RAG embedding pipeline.
Python Implementation
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
def build_corpus(seeds, per_query=10):
urls = set()
for q in seeds:
r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
for o in (r.get('organic_results') or [])[:per_query]:
urls.add(o['link'])
docs = []
for u in list(urls)[:2000]:
d = requests.post('https://api.scavio.dev/api/v1/extract', headers=H, json={'url': u}).json()
if d.get('text'): docs.append(d['text'])
return docsJavaScript Implementation
// Same shape in TS — search per seed, dedupe, extract top-N.Platforms Used
Web search with knowledge graph, PAA, and AI overviews