playwrightscrapingagent-architecture

Playwright Fallback: The Search-First Pattern in 2026

An r/LangChain DaaS post showed it. Route by target type: search API for indexed (85%), browser for auth-gated (15%). 80%+ less captcha exposure.

5 min read

An r/LangChain post (also cross-posted to r/crewai) described an autonomous DaaS architecture for LATAM gov sites where Playwright kept breaking. The fallback: Google Dorks + Llama-3 + MCP. The pattern generalizes beyond LATAM gov; it applies to any agent that scrapes public data at scale.

Why pure-Playwright pipelines fail

Three failure modes show up at scale:

  • Cloudflare/captcha walls. Browsers look human enough to pass first checks but fail under repeated load. Captcha appears, then IP blocks.
  • JS rendering cost. Headless browsers cost ~$0.50-2.00 per page in browser-time. At scale, this dwarfs search-API per-call costs.
  • Maintenance debt. Selectors change, anti-bot scripts update, captchas evolve. Every break costs an engineer-day.

The search-first fallback architecture

Route by target type:

  • Indexed/public targets -> structured search API (Scavio, SerpAPI, Tavily). Cheap, reliable, no browser fight.
  • Auth-gated/JS-only targets -> real browser (Playwright/Stagehand/Browserbase). Expensive but necessary.

For most agents, the first bucket is 80-95% of work. The second is the remaining minority that genuinely needs a browser.

The dorks-first pattern

Google Dorks (site:, filetype:,intitle:, inurl:) are the bridge between "I want this gov document" and "the gov portal blocks Playwright". Google has indexed the document; the search API returns it as typed JSON.

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

DORK_TEMPLATES = [
    'site:{domain} filetype:pdf {topic}',
    'site:{domain} intitle:{topic}',
    'site:{domain} inurl:reports {topic}',
    'site:{domain} {topic} 2026',
]

def search_first(domain, topic):
    """Dorked SERP via Scavio. No browser, no captcha fight."""
    urls = []
    for tpl in DORK_TEMPLATES:
        q = tpl.format(domain=domain, topic=topic)
        r = requests.post('https://api.scavio.dev/api/v1/search',
            headers=H, json={'query': q}).json()
        urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
    return list(set(urls))

def extract(url):
    """Markdown ready for LLM extraction."""
    return requests.post('https://api.scavio.dev/api/v1/extract',
        headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')

The Llama-3 (or any LLM) extraction step

Once the document is in markdown, structured extraction is a standard LLM prompt:

Python
PROMPT = '''Extract from this gov document:
- title, date_published, summary (3 sentences), key entities (list)
Return JSON: {"title": ..., "date_published": ..., "summary": ..., "entities": [...]}

Document:
{md}'''

result = llm.complete(PROMPT.format(md=markdown))

The MCP wiring

For agent stacks, expose the search-first pattern as a named MCP tool. Scavio's hosted MCP at https://mcp.scavio.dev/mcp already exposes search, reddit_search, youtube_search, extract as named tools. Attach to Claude Code with one line:

Bash
claude mcp add scavio https://mcp.scavio.dev/mcp \
  --header 'x-api-key: $SCAVIO_API_KEY'

Cost math vs pure Playwright

For a 1,000-page gov document extraction job:

  • Pure Playwright on Browserbase Developer: ~$2-5 in browser-time + 30-50% captcha failure rate (= rerun costs)
  • Search-first with Scavio: ~$4.30 in search/extract credits + 1-2% failure rate

Raw cost is comparable; variance is wildly different. The search-first pipeline runs predictably; the Playwright pipeline burns engineer time on failures.

When search-first does not work

Three cases keep Playwright as the right call:

  • Auth-gated portals (login required)
  • JS-only SPAs that don't server-side render
  • Documents not indexed by Google (intranet, recent uploads behind robots.txt)

For everything else, the search-first fallback wins on operational stability, which is what production agents actually optimize for.

The honest summary

The r/LangChain post pattern is real and portable. Replace the "Playwright breaks weekly" story in your stack with "dorked search via Scavio for 85% of targets, Playwright for the auth-gated 15%". The agent stops breaking; the engineer-time bill drops proportionally.