An r/LangChain post built an autonomous DaaS architecture for LATAM gov sites: Google Dorks + Llama-3 + MCP, because Playwright kept breaking on Cloudflare. This walks the same pattern with Scavio's structured Google SERP.
Prerequisites
- Scavio API key
- An LLM (any)
- A target domain (the 'site:' anchor)
Walkthrough
Step 1: Define the dork template
Standard Google operators: site:, filetype:, intitle:, inurl:.
TEMPLATES = [
'site:{domain} filetype:pdf {topic}',
'site:{domain} intitle:{topic}',
'site:{domain} inurl:reports {topic}',
'site:{domain} {topic} 2026',
]Step 2: Run dorked queries via Scavio
Each dork is a normal query, Scavio returns the SERP.
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
def dork_search(domain, topic):
results = []
for tpl in TEMPLATES:
q = tpl.format(domain=domain, topic=topic)
r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
results.extend(r.get('organic_results', []))
return resultsStep 3: Dedupe by URL
Same URL across templates is one source.
def dedupe(results):
seen = set()
out = []
for r in results:
if r['link'] not in seen:
seen.add(r['link'])
out.append(r)
return outStep 4: Extract clean markdown for top hits
Scavio /extract turns each PDF/HTML into markdown.
def extract(url):
return requests.post('https://api.scavio.dev/api/v1/extract',
headers=H, json={'url': url, 'format': 'markdown'}).json()Step 5: LLM extraction step
Pass markdown + a structured-extract prompt.
PROMPT = '''Extract from this document: title, date, summary (3 sentences), key entities (list).
Document:
{md}
---
Return JSON: {{"title": ..., "date": ..., "summary": ..., "entities": [...]}}'''
result = llm.complete(PROMPT.format(md=markdown))Python Example
# Per gov-doc pipeline: 4 dorked searches + 1 extract + 1 LLM call = ~$0.025-0.05JavaScript Example
// Same pipeline in TS.Expected Output
Structured records (title, date, summary, entities) for indexed gov documents. Skips the Playwright/Cloudflare fight entirely. Limit: only works on publicly-indexed pages.