Why Real-Time Scraping Fails for Government Portals
Layouts change, captchas appear, PDFs break context. Async dawn-cron pattern replaces the live Selenium fight.
An r/LangChain post described the LATAM-government-portal pain in a single line: "Selenium and Playwright on government portals is a nightmare. Layouts change, captchas appear, PDFs break the context window." The OP migrated to a Google Dorks plus LLM extraction pipeline and never looked back. The reasons real-time scraping fails on government targets are worth unpacking.
The layout problem
Government portals change layouts on no schedule. A working selector breaks when an IT department deploys a CSS update on a Friday afternoon. Production agents downstream fail Saturday morning. The fix takes hours of investigation in a live browser to find the new selector path.
The captcha problem
Many portals (especially in LATAM and the EU) deploy reCAPTCHA or hCaptcha when traffic patterns look botlike. The captcha breaks the headless browser. Solving it means either a third- party captcha-solving service ($) or human-in-the-loop relays ($$$). Neither scales for a daily agent run.
The PDF problem
Government bid documents are PDFs. Sometimes scanned PDFs. Often 50-200 pages. Pulling them into the agent context blows the token budget. Even after OCR and stripping boilerplate, a single bid PDF can run 20K+ tokens — too much to feed alongside the agent's reasoning prompt.
The fix is async
Move the discovery and extraction step out of the agent's live loop. Cron at 4 AM. Google Dorks for fresh PDFs. Extract endpoint converts each PDF to markdown. LLM converts markdown to typed JSON with only the fields the agent cares about (title, deadline, amount, agency). Cache in SQLite.
import os, requests
API_KEY = os.environ['SCAVIO_API_KEY']
H = {'x-api-key': API_KEY}
DORKS = [
'site:gov.br filetype:pdf 2026 contratos',
'site:gob.mx filetype:pdf 2026 licitaciones'
]
def fetch_fresh_pdfs():
out = []
for q in DORKS:
r = requests.post('https://api.scavio.dev/api/v1/search',
headers=H, json={'query': q}).json()
for o in r.get('organic_results', []):
if o.get('link', '').endswith('.pdf'):
e = requests.post('https://api.scavio.dev/api/v1/extract',
headers=H, json={'url': o['link'], 'format': 'markdown'}).json()
out.append({'url': o['link'], 'md': e.get('markdown', '')})
return outWhat the LLM step does
The OP used Llama-3 via Groq for typed-JSON conversion. Any model with strong JSON output works. The prompt asks for specific fields and rejects responses that don't parse. Markdown of a 50-page PDF runs 8K-15K tokens; the typed JSON output is 200-500 tokens. The agent's context budget is protected.
What the cache buys
The agent runs during business hours. Every query hits the cache. Sub-50ms response time. Predictable cost. The agent feels instant; the heavy lifting happened overnight.
What the pattern doesn't solve
Sites blocked from Google's index. Auth-gated portals. Real-time tenders that publish and close within hours. For those, you still need a real browser or a direct relationship with the publisher. The Selenium pipeline still wins for those edge cases — just don't use it for the whole job.
What changed for the OP
Maintenance dropped to zero. The agent serves CrewAI requests in 50ms instead of 30 seconds. Captcha rotation became irrelevant because Google does the indexing. The bid PDFs compress into typed JSON the agent can reason about cleanly.
The architectural lesson
Real-time scraping is the wrong default for indexed targets. Async discovery plus typed-JSON cache is the right default. The live agent loop should never be where the network fight happens. Move that fight outside business hours and the agent stays fast and cheap.