Overview
Daily run: per gov-doc topic, dork-search via Scavio for indexed pages; route auth-gated targets to Playwright. Extract structured records.
Trigger
Daily cron 7am
Schedule
Daily 7am
Workflow Steps
Load target list (domain + topic)
From a YAML config or DB table.
Per target: classify indexed vs auth-gated
Use a per-target flag set during onboarding.
Indexed: Scavio dorked search across 4 templates
site:, filetype:, intitle:, inurl: variations.
Dedupe URLs across templates
Same URL across dorks = one source.
Scavio /extract for top-N URLs
Markdown ready for LLM extraction.
Auth-gated: Playwright/Stagehand fetch
Only the small subset that requires login.
LLM structured extraction
Per markdown blob, return JSON {title, date, summary, entities}.
Append to records DB
Postgres / Sheets / etc.
Python Implementation
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
DORKS = ['site:{d} filetype:pdf {t}', 'site:{d} intitle:{t}', 'site:{d} inurl:reports {t}']
def search_first(domain, topic):
urls = []
for tpl in DORKS:
q = tpl.format(d=domain, t=topic)
r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
return list(set(urls))JavaScript Implementation
// Same in TS.Platforms Used
Web search with knowledge graph, PAA, and AI overviews