directoryscrapingscavio

Building a Comprehensive Directory When Google Maps Misses Listings

An r/webscraping post: target service is sub-program inside larger orgs, Maps misses most. Multi-source dorked search + LLM extract.

May 1, 2026

5 min read

An r/webscraping post had a specific blocker most directory builders eventually hit: the target service is often offered as a sub-program inside a larger org, so Google Maps + keyword searches miss most of the listings. Single-source scraping fails on fragmented verticals; the fix is multi-source dorked search + LLM-driven entity extraction.

Why fragmented verticals break single-source scraping

Google Maps catalogs "Acme Plumbing LLC" cleanly. It doesn't catalog "Acme Plumbing's drain-cleaning subprogram, available Tuesday and Thursday afternoons by appointment." Yelp doesn't either. Yellow Pages doesn't. For services that exist as sub-programs (community Wi-Fi initiatives inside libraries, pediatric speech therapy programs inside hospitals, free legal aid inside law clinics), single-source scrapers catalog 20-30% of reality.

The source list discipline

For most fragmented verticals, 3-7 sources cover 80%+ of the long tail:

National association directory.
State / regional regulator databases (where applicable).
1-2 niche aggregator sites (vertical-specific).
Reddit communities for the vertical.
.gov / .edu pages mentioning the program type.
Foundation / nonprofit grant directories (for nonprofit-adjacent services).
Industry conference attendee lists (often public).

Per-source dork generation

Each source needs a different dork shape. Build the set once per vertical:

Python

import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

def discover_directory(vertical, geography):
    dorks = [
        f'site:national-assoc.org "{vertical}" programs',
        f'site:state.gov "{vertical}" {geography}',
        f'site:reddit.com/r/{vertical_subreddit} recommendation',
        f'site:facebook.com/groups "{vertical}" {geography}',
        f'site:foundation.org grant "{vertical}" {geography}',
    ]
    out = []
    for q in dorks:
        r = requests.post(
            'https://api.scavio.dev/api/v1/search',
            headers=H, json={'query': q}
        ).json()
        out.extend(r.get('organic_results', []))
    return out

LLM extraction handles the variation

Sub-programs are mentioned in wildly different formats across sources. A registry might say "XYZ Hospital — Department of Pediatric Speech". A Facebook group post might say "the speech program at XYZ". A Reddit thread might say "saw kids getting speech therapy at XYZ hospital, ask for the pediatric department". Pattern matching fails. LLM extraction works.

Text

LLM prompt:
Extract every entity (org or program) offering SERVICE in
CITY/STATE from this snippet. An entity is anything that
provides the service, even if it's a sub-program inside a
larger org.

Return JSON list:
[{name, parent_org?, address?, phone?, source_url, evidence_quote}]

Dedup across sources is hard but mandatory

The same hospital's pediatric speech program will appear in 4 sources with 4 different namings. Dedup by hash(parent_org_normalized + program_type) + manual review of top-tier matches. Skip dedup and the directory looks larger than it is, which users notice and trust collapses.

Local Pack cross-check fills missing fields

For each entity surfaced, run a Scavio Local Pack search to fetch missing address/phone/website where available. This is the "fill the gaps" layer: dorked discovery surfaces existence, Local Pack fills standard fields where the entity has a public profile.

Per-vertical-state cost

~50-200 Scavio calls + LLM extraction per snippet ≈ $1-5 per state mapped per vertical. A full US 50-state map for a fragmented vertical: under $200 total. The deliverable (the directory) is a moat for niche SaaS, content sites, or services marketplaces. The unit economics are strongly favorable when the directory is the product.

Publishing as the moat

Once you have the directory, you have something single-source scrapers can't produce. Publish as a Next.js / Astro site. Each entity = one page. Internal links across geography facets. Schema markup. Sitemap submitted. Within 6-12 months, you rank for "[service] in [city]" queries against single-source-scraper competitors who missed the long tail.

Why this isn't commoditizing fast

The work is per-vertical research. There's no general "build a comprehensive directory of any service" tool because the source list is always vertical-specific. That research overhead is the structural advantage; the team that does it for one vertical at a time builds a compounding portfolio.

What this stack ISN'T

It isn't Maps replacement for verticals where Maps coverage is genuinely comprehensive (chain retail, restaurants, gas stations). For those, Outscraper or similar Maps-native APIs are cheaper and complete. The dorked-search + LLM-extract approach pays back specifically on the long tail Maps misses.