scrapinggoogle-dorkscrewai

Google Dorks Plus LLM Replaces Real-Time Scraping

Cron at dawn, Google Dorks discovery, LLM-typed JSON, SQLite cache. Pattern from r/LangChain replaces brittle Selenium.

5 min read

An r/LangChain post documented a clean migration: from Selenium plus Playwright scraping of LATAM government portals to a Google Dorks plus Llama-3 plus MCP architecture. The old pipeline broke weekly. The new one runs at dawn, returns typed JSON in 50ms from cache, and does not fight captchas.

Why real-time scraping fails for government portals

Government sites change layouts unpredictably. Captchas appear when traffic patterns shift. PDFs are large enough to break LLM context windows. Maintaining a Selenium fleet for 30 portals burns a full-time engineer.

The new shape

Cron at dawn. Google Dorks via a search API to discover fresh PDFs. Extract endpoint to convert PDFs to markdown. LLM to convert markdown to typed JSON. SQLite cache for sub-50ms repeat lookups. CrewAI agent reads from the cache during business hours.

Python
import os, requests, sqlite3, json, time
API_KEY = os.environ['SCAVIO_API_KEY']
H = {'x-api-key': API_KEY}

DORKS = [
    'site:gov.br filetype:pdf 2026 contratos',
    'site:gob.mx filetype:pdf 2026 licitaciones',
    'site:gob.cl filetype:pdf 2026 licitaciones'
]

def crawl():
    conn = sqlite3.connect('bids.db')
    conn.execute('CREATE TABLE IF NOT EXISTS bids(url TEXT PRIMARY KEY, payload TEXT, ts REAL)')
    for q in DORKS:
        r = requests.post('https://api.scavio.dev/api/v1/search',
            headers=H, json={'query': q}).json()
        for o in r.get('organic_results', []):
            if o.get('link', '').endswith('.pdf'):
                e = requests.post('https://api.scavio.dev/api/v1/extract',
                    headers=H, json={'url': o['link'], 'format': 'markdown'}).json()
                conn.execute('INSERT OR REPLACE INTO bids VALUES (?, ?, ?)',
                    (o['link'], json.dumps(e), time.time()))
    conn.commit()

The LLM step

The original post used Llama-3 via Groq for the typed-JSON conversion. Any model with strong JSON output works — Claude Haiku, GPT-4o-mini, DeepSeek. The prompt asks for fields the agent downstream needs (title, deadline, amount, agency) and rejects responses that don't parse.

Why dawn matters

Running the discovery and extraction step at 4 AM moves the latency-sensitive work outside business hours. When the CrewAI agent runs during the day, every query is a cache hit. The agent feels instant; the heavy lifting happened overnight.

What about cache misses

New queries during business hours fall through to a live Scavio call with extract. That's slower (1-3 seconds end to end) but rare. Set the cache TTL to 24 hours for government portals — fresher sources can use 1-6 hour TTLs.

Cost compared to Selenium plus proxies

A Selenium fleet with rotating proxies for 30 portals runs $200+/mo on proxy plus a couple hours of engineer time per week on maintenance. The Scavio version: 5 dorks × ~20 PDFs each = 105 calls/day at the Project tier = ~$0.45/day. The IP rotation problem disappears entirely.

Honest constraints

The pattern only works for portals whose content Google has indexed. If a site uses robots.txt to block Google, dorks won't see it. For those targets, you either need a direct relationship with the publisher or a real cloud browser. The Selenium pipeline still wins for those edge cases.

The MCP layer for CrewAI

Wire mcp.scavio.dev/mcp into the CrewAI agent so it can call search and extract directly. The cache sits between the agent and the live API. The agent code stays small because the heavy lifting moved to the cron job.