An r/webscraping post wanted to build a national directory for a service that's often a sub-program inside larger orgs (so Google Maps misses most of it). This walks the dorked-search + LLM-extract approach.
Prerequisites
- A target vertical
- Scavio API key
- LLM API key
Walkthrough
Step 1: List sources that already aggregate the long tail
Associations, gov, niche aggregators.
// 3-7 sources cover 80% of long tail:
// - National association directory
// - State/regional reg databases
// - 1-2 niche aggregator sites
// - Reddit communities
// - .gov pages mentioning the program typeStep 2: Build a per-source dork set in Scavio
Each source = different query shape.
// site:national-assoc.org programs (national)
// site:state.gov VERTICAL (per state)
// site:niche-aggregator.com (per niche)
// site:reddit.com/r/VERTICAL recommendation OR best 2026Step 3: Run Scavio across the dork set per geography
City, state, or zip-level.
// Per state per source-set:
// Run dork → collect organic_results URLs → store source URL + snippetStep 4: LLM-extract: 'find every entity offering SERVICE in this snippet'
Wide-variation handling.
// LLM: 'Extract every entity (org or program) offering SERVICE in CITY/STATE. Return JSON {name, address?, phone?, parent_org?, source_url}.'Step 5: Dedup across sources, fill gaps via Scavio Local Pack lookup
Best-effort enrichment.
// Dedup by hash(name + state). Optional Scavio Local Pack search to fetch address/phone if missing.Step 6: Publish (or paywall) the directory
Moat = source curation.
// Structured data on a Next.js site. Each entity = one page. Internal links across geography facets.Python Example
# Per-vertical-state: ~50-200 Scavio calls + LLM extraction ≈ $1-5 per state mapped.JavaScript Example
// Same in TS.Expected Output
A comprehensive national directory of fragmented-vertical programs/orgs. Catches sub-program listings Maps misses. Becomes a moat for content sites or niche SaaS.