An r/LangChain post shared an autonomous DaaS architecture using Google Dorks + Llama-3 + MCP. The pattern works for any structured-document discovery job. This tutorial walks the same flow on Scavio.
Prerequisites
- Python 3.10+
- Scavio API key
- Groq or Anthropic API key
Walkthrough
Step 1: Dork patterns for the target
site: + filetype: + keyword.
DORKS = ['site:gov.br filetype:pdf 2026 contratos', 'site:europa.eu filetype:pdf AI act']Step 2: Run the dork via Scavio search
Returns organic results pointing to PDFs.
import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']
def dork(q):
return requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY}, json={'query': q}).json()Step 3: Filter for fresh PDFs
Date filter or LLM screening.
def fresh_pdfs(results, year='2026'):
return [r for r in results.get('organic_results', []) if year in r.get('snippet', '') and r['link'].endswith('.pdf')]Step 4: Extract PDF to text via Scavio extract
PDF-aware extract returns markdown.
def pdf_to_text(url):
r = requests.post('https://api.scavio.dev/api/v1/extract',
headers={'x-api-key': API_KEY},
json={'url': url, 'format': 'markdown'}).json()
return r.get('markdown', '')Step 5: LLM converts garbage text to typed JSON
Strict-schema prompt; reject if doesn't parse.
import anthropic, json
client = anthropic.Anthropic()
def typed(md):
msg = client.messages.create(model='claude-sonnet-4-6', max_tokens=600,
messages=[{'role':'user','content':f'Extract opportunity details as JSON: title, deadline, amount, agency. Source: {md[:6000]}'}])
return json.loads(msg.content[0].text)Python Example
# Daily run: 5 dorks × ~20 PDFs each = ~105 calls = ~$0.45 on Project tier.JavaScript Example
// TS version uses the same endpoints.Expected Output
Government bid PDFs converted to typed JSON daily. Cache layer keeps repeat queries at sub-50ms.