Playwright Fallback: The Search-First Pattern in 2026
An r/LangChain DaaS post showed it. Route by target type: search API for indexed (85%), browser for auth-gated (15%). 80%+ less captcha exposure.
An r/LangChain post (also cross-posted to r/crewai) described an autonomous DaaS architecture for LATAM gov sites where Playwright kept breaking. The fallback: Google Dorks + Llama-3 + MCP. The pattern generalizes beyond LATAM gov; it applies to any agent that scrapes public data at scale.
Why pure-Playwright pipelines fail
Three failure modes show up at scale:
- Cloudflare/captcha walls. Browsers look human enough to pass first checks but fail under repeated load. Captcha appears, then IP blocks.
- JS rendering cost. Headless browsers cost ~$0.50-2.00 per page in browser-time. At scale, this dwarfs search-API per-call costs.
- Maintenance debt. Selectors change, anti-bot scripts update, captchas evolve. Every break costs an engineer-day.
The search-first fallback architecture
Route by target type:
- Indexed/public targets -> structured search API (Scavio, SerpAPI, Tavily). Cheap, reliable, no browser fight.
- Auth-gated/JS-only targets -> real browser (Playwright/Stagehand/Browserbase). Expensive but necessary.
For most agents, the first bucket is 80-95% of work. The second is the remaining minority that genuinely needs a browser.
The dorks-first pattern
Google Dorks (site:, filetype:,intitle:, inurl:) are the bridge between "I want this gov document" and "the gov portal blocks Playwright". Google has indexed the document; the search API returns it as typed JSON.
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
DORK_TEMPLATES = [
'site:{domain} filetype:pdf {topic}',
'site:{domain} intitle:{topic}',
'site:{domain} inurl:reports {topic}',
'site:{domain} {topic} 2026',
]
def search_first(domain, topic):
"""Dorked SERP via Scavio. No browser, no captcha fight."""
urls = []
for tpl in DORK_TEMPLATES:
q = tpl.format(domain=domain, topic=topic)
r = requests.post('https://api.scavio.dev/api/v1/search',
headers=H, json={'query': q}).json()
urls.extend(o['link'] for o in r.get('organic_results', [])[:5])
return list(set(urls))
def extract(url):
"""Markdown ready for LLM extraction."""
return requests.post('https://api.scavio.dev/api/v1/extract',
headers=H, json={'url': url, 'format': 'markdown'}).json().get('markdown', '')The Llama-3 (or any LLM) extraction step
Once the document is in markdown, structured extraction is a standard LLM prompt:
PROMPT = '''Extract from this gov document:
- title, date_published, summary (3 sentences), key entities (list)
Return JSON: {"title": ..., "date_published": ..., "summary": ..., "entities": [...]}
Document:
{md}'''
result = llm.complete(PROMPT.format(md=markdown))The MCP wiring
For agent stacks, expose the search-first pattern as a named MCP tool. Scavio's hosted MCP at https://mcp.scavio.dev/mcp already exposes search, reddit_search, youtube_search, extract as named tools. Attach to Claude Code with one line:
claude mcp add scavio https://mcp.scavio.dev/mcp \
--header 'x-api-key: $SCAVIO_API_KEY'Cost math vs pure Playwright
For a 1,000-page gov document extraction job:
- Pure Playwright on Browserbase Developer: ~$2-5 in browser-time + 30-50% captcha failure rate (= rerun costs)
- Search-first with Scavio: ~$4.30 in search/extract credits + 1-2% failure rate
Raw cost is comparable; variance is wildly different. The search-first pipeline runs predictably; the Playwright pipeline burns engineer time on failures.
When search-first does not work
Three cases keep Playwright as the right call:
- Auth-gated portals (login required)
- JS-only SPAs that don't server-side render
- Documents not indexed by Google (intranet, recent uploads behind robots.txt)
For everything else, the search-first fallback wins on operational stability, which is what production agents actually optimize for.
The honest summary
The r/LangChain post pattern is real and portable. Replace the "Playwright breaks weekly" story in your stack with "dorked search via Scavio for 85% of targets, Playwright for the auth-gated 15%". The agent stops breaking; the engineer-time bill drops proportionally.