Tutorial

How to Build a Google Dorks Fallback Pipeline with Scavio

An r/LangChain post built a Dorks+LLM+MCP pipeline for gov sites where Playwright kept breaking. This walks the same pattern with Scavio.

An r/LangChain post built an autonomous DaaS architecture for LATAM gov sites: Google Dorks + Llama-3 + MCP, because Playwright kept breaking on Cloudflare. This walks the same pattern with Scavio's structured Google SERP.

Prerequisites

  • Scavio API key
  • An LLM (any)
  • A target domain (the 'site:' anchor)

Walkthrough

Step 1: Define the dork template

Standard Google operators: site:, filetype:, intitle:, inurl:.

Python
TEMPLATES = [
    'site:{domain} filetype:pdf {topic}',
    'site:{domain} intitle:{topic}',
    'site:{domain} inurl:reports {topic}',
    'site:{domain} {topic} 2026',
]

Step 2: Run dorked queries via Scavio

Each dork is a normal query, Scavio returns the SERP.

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
def dork_search(domain, topic):
    results = []
    for tpl in TEMPLATES:
        q = tpl.format(domain=domain, topic=topic)
        r = requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()
        results.extend(r.get('organic_results', []))
    return results

Step 3: Dedupe by URL

Same URL across templates is one source.

Python
def dedupe(results):
    seen = set()
    out = []
    for r in results:
        if r['link'] not in seen:
            seen.add(r['link'])
            out.append(r)
    return out

Step 4: Extract clean markdown for top hits

Scavio /extract turns each PDF/HTML into markdown.

Python
def extract(url):
    return requests.post('https://api.scavio.dev/api/v1/extract',
        headers=H, json={'url': url, 'format': 'markdown'}).json()

Step 5: LLM extraction step

Pass markdown + a structured-extract prompt.

Python
PROMPT = '''Extract from this document: title, date, summary (3 sentences), key entities (list).
Document:
{md}
---
Return JSON: {{"title": ..., "date": ..., "summary": ..., "entities": [...]}}'''
result = llm.complete(PROMPT.format(md=markdown))

Python Example

Python
# Per gov-doc pipeline: 4 dorked searches + 1 extract + 1 LLM call = ~$0.025-0.05

JavaScript Example

JavaScript
// Same pipeline in TS.

Expected Output

JSON
Structured records (title, date, summary, entities) for indexed gov documents. Skips the Playwright/Cloudflare fight entirely. Limit: only works on publicly-indexed pages.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Scavio API key. An LLM (any). A target domain (the 'site:' anchor). A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

An r/LangChain post built a Dorks+LLM+MCP pipeline for gov sites where Playwright kept breaking. This walks the same pattern with Scavio.