Workflow

Gov Portal Search Fallback Workflow

Daily extraction from gov portals using Scavio dorked search first; Playwright only for auth-gated targets. Cuts captcha exposure 80%+.

Overview

Daily run: per gov-doc topic, dork-search via Scavio for indexed pages; route auth-gated targets to Playwright. Extract structured records.

Trigger

Daily cron 7am

Schedule

Daily 7am

Workflow Steps

1

Load target list (domain + topic)

From a YAML config or DB table.

2

Per target: classify indexed vs auth-gated

Use a per-target flag set during onboarding.

3

Indexed: Scavio dorked search across 4 templates

site:, filetype:, intitle:, inurl: variations.

4

Dedupe URLs across templates

Same URL across dorks = one source.

5

Scavio /extract for top-N URLs

Markdown ready for LLM extraction.

6

Auth-gated: Playwright/Stagehand fetch

Only the small subset that requires login.

7

LLM structured extraction

Per markdown blob, return JSON {title, date, summary, entities}.

8

Append to records DB

Postgres / Sheets / etc.

Python Implementation

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
DORKS = ['site:{d} filetype:pdf {t}', 'site:{d} intitle:{t}', 'site:{d} inurl:reports {t}']

def search_first(domain, topic):
    urls = []
    for tpl in DORKS:
        q = tpl.format(d=domain, t=topic)
        r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
        urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
    return list(set(urls))

JavaScript Implementation

JavaScript
// Same in TS.

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Frequently Asked Questions

Daily run: per gov-doc topic, dork-search via Scavio for indexed pages; route auth-gated targets to Playwright. Extract structured records.

This workflow uses a daily cron 7am. Daily 7am.

This workflow uses the following Scavio platforms: google. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 500 credits per month with no credit card required. That is enough to test and validate this workflow before scaling it.

Gov Portal Search Fallback Workflow

Daily extraction from gov portals using Scavio dorked search first; Playwright only for auth-gated targets. Cuts captcha exposure 80%+.