Google Dorks + LLM Extraction Pipeline (2026)

An r/LangChain post shared an autonomous DaaS architecture using Google Dorks + Llama-3 + MCP. The pattern works for any structured-document discovery job. This tutorial walks the same flow on Scavio.

Prerequisites

Python 3.10+
Scavio API key
Groq or Anthropic API key

Walkthrough

Step 1: Dork patterns for the target

site: + filetype: + keyword.

Python

DORKS = ['site:gov.br filetype:pdf 2026 contratos', 'site:europa.eu filetype:pdf AI act']

Step 2: Run the dork via Scavio search

Returns organic results pointing to PDFs.

Python

import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']

def dork(q):
    return requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY}, json={'query': q}).json()

Step 3: Filter for fresh PDFs

Date filter or LLM screening.

Python

def fresh_pdfs(results, year='2026'):
    return [r for r in results.get('organic_results', []) if year in r.get('snippet', '') and r['link'].endswith('.pdf')]

Step 4: Extract PDF to text via Scavio extract

PDF-aware extract returns markdown.

Python

def pdf_to_text(url):
    r = requests.post('https://api.scavio.dev/api/v1/extract',
        headers={'x-api-key': API_KEY},
        json={'url': url, 'format': 'markdown'}).json()
    return r.get('markdown', '')

Step 5: LLM converts garbage text to typed JSON

Strict-schema prompt; reject if doesn't parse.

Python

import anthropic, json
client = anthropic.Anthropic()

def typed(md):
    msg = client.messages.create(model='claude-sonnet-4-6', max_tokens=600,
        messages=[{'role':'user','content':f'Extract opportunity details as JSON: title, deadline, amount, agency. Source: {md[:6000]}'}])
    return json.loads(msg.content[0].text)

Python Example

Python

# Daily run: 5 dorks × ~20 PDFs each = ~105 calls = ~$0.45 on Project tier.

JavaScript Example

JavaScript

// TS version uses the same endpoints.

Expected Output

JSON

Government bid PDFs converted to typed JSON daily. Cache layer keeps repeat queries at sub-50ms.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.10+. Scavio API key. Groq or Anthropic API key. A Scavio API key gives you 50 free credits on signup.

Yes. The free tier includes 50 credits on signup, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Walkthrough

Step 1: Dork patterns for the target

site: + filetype: + keyword.

Python

DORKS = ['site:gov.br filetype:pdf 2026 contratos', 'site:europa.eu filetype:pdf AI act']

Step 2: Run the dork via Scavio search

Returns organic results pointing to PDFs.

Python

import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']

def dork(q):
    return requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY}, json={'query': q}).json()

Step 3: Filter for fresh PDFs

Date filter or LLM screening.

Python

def fresh_pdfs(results, year='2026'):
    return [r for r in results.get('organic_results', []) if year in r.get('snippet', '') and r['link'].endswith('.pdf')]

Step 4: Extract PDF to text via Scavio extract

PDF-aware extract returns markdown.

Python

def pdf_to_text(url):
    r = requests.post('https://api.scavio.dev/api/v1/extract',
        headers={'x-api-key': API_KEY},
        json={'url': url, 'format': 'markdown'}).json()
    return r.get('markdown', '')

Step 5: LLM converts garbage text to typed JSON

Strict-schema prompt; reject if doesn't parse.

Python

import anthropic, json
client = anthropic.Anthropic()

def typed(md):
    msg = client.messages.create(model='claude-sonnet-4-6', max_tokens=600,
        messages=[{'role':'user','content':f'Extract opportunity details as JSON: title, deadline, amount, agency. Source: {md[:6000]}'}])
    return json.loads(msg.content[0].text)

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.10+. Scavio API key. Groq or Anthropic API key. A Scavio API key gives you 50 free credits on signup.

Yes. The free tier includes 50 credits on signup, which is more than enough to complete this tutorial and prototype a working solution.

How to Build a Google Dorks + LLM Extraction Pipeline

Prerequisites

Walkthrough

Step 1: Dork patterns for the target

Step 2: Run the dork via Scavio search

Step 3: Filter for fresh PDFs

Step 4: Extract PDF to text via Scavio extract

Step 5: LLM converts garbage text to typed JSON

Python Example

JavaScript Example

Expected Output

Related Tutorials

Frequently Asked Questions

How long does this build a google dorks + llm extraction pipeline tutorial take?

What do I need before starting?

Can I run this tutorial with the free tier?

What frameworks does this work with?

Related Resources

Government Bid Monitoring Workflow

News Digest Agent Pipeline

Government Portal Monitoring SDR Agent

Government Portal Scraping Alternative

Google Dorks Pipeline

Google AI Agent

Start Building

How to Build a Google Dorks + LLM Extraction Pipeline

Prerequisites

Walkthrough

Step 1: Dork patterns for the target

Step 2: Run the dork via Scavio search

Step 3: Filter for fresh PDFs

Step 4: Extract PDF to text via Scavio extract

Step 5: LLM converts garbage text to typed JSON

Python Example

JavaScript Example

Expected Output

Related Tutorials

Frequently Asked Questions

How long does this build a google dorks + llm extraction pipeline tutorial take?

What do I need before starting?

Can I run this tutorial with the free tier?

What frameworks does this work with?

Related Resources

Government Bid Monitoring Workflow

News Digest Agent Pipeline

Government Portal Monitoring SDR Agent

Government Portal Scraping Alternative

Google Dorks Pipeline

Google AI Agent

Start Building