Tutorial

How to Build a Google Dorks + LLM Extraction Pipeline

Combine Google Dorks search with LLM extraction to turn PDFs and government portals into typed JSON. Pattern from r/LangChain's DaaS build.

An r/LangChain post shared an autonomous DaaS architecture using Google Dorks + Llama-3 + MCP. The pattern works for any structured-document discovery job. This tutorial walks the same flow on Scavio.

Prerequisites

  • Python 3.10+
  • Scavio API key
  • Groq or Anthropic API key

Walkthrough

Step 1: Dork patterns for the target

site: + filetype: + keyword.

Python
DORKS = ['site:gov.br filetype:pdf 2026 contratos', 'site:europa.eu filetype:pdf AI act']

Step 2: Run the dork via Scavio search

Returns organic results pointing to PDFs.

Python
import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']

def dork(q):
    return requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY}, json={'query': q}).json()

Step 3: Filter for fresh PDFs

Date filter or LLM screening.

Python
def fresh_pdfs(results, year='2026'):
    return [r for r in results.get('organic_results', []) if year in r.get('snippet', '') and r['link'].endswith('.pdf')]

Step 4: Extract PDF to text via Scavio extract

PDF-aware extract returns markdown.

Python
def pdf_to_text(url):
    r = requests.post('https://api.scavio.dev/api/v1/extract',
        headers={'x-api-key': API_KEY},
        json={'url': url, 'format': 'markdown'}).json()
    return r.get('markdown', '')

Step 5: LLM converts garbage text to typed JSON

Strict-schema prompt; reject if doesn't parse.

Python
import anthropic, json
client = anthropic.Anthropic()

def typed(md):
    msg = client.messages.create(model='claude-sonnet-4-6', max_tokens=600,
        messages=[{'role':'user','content':f'Extract opportunity details as JSON: title, deadline, amount, agency. Source: {md[:6000]}'}])
    return json.loads(msg.content[0].text)

Python Example

Python
# Daily run: 5 dorks × ~20 PDFs each = ~105 calls = ~$0.45 on Project tier.

JavaScript Example

JavaScript
// TS version uses the same endpoints.

Expected Output

JSON
Government bid PDFs converted to typed JSON daily. Cache layer keeps repeat queries at sub-50ms.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.10+. Scavio API key. Groq or Anthropic API key. A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

Combine Google Dorks search with LLM extraction to turn PDFs and government portals into typed JSON. Pattern from r/LangChain's DaaS build.