An r/LangChain post documented an autonomous DaaS architecture: Google Dorks discovery, Llama-3 transformation, MCP serving with SQLite cache. This tutorial walks the same architecture on Scavio.
Prerequisites
- Python 3.10+
- LangChain
- Scavio API key
- SQLite (built-in)
Walkthrough
Step 1: Dorks list
Define the discovery queries.
DORKS = [
'site:gov.br filetype:pdf 2026 contratos',
'site:europa.eu filetype:pdf AI Act',
'site:sec.gov filetype:pdf 10-K 2026',
]Step 2: Discovery via Scavio /search
Run each dork.
import os, requests
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
def discover(q):
return requests.post('https://api.scavio.dev/api/v1/search', headers=H, json={'query': q}).json()Step 3: PDF extraction via /extract
Per discovered URL.
def fetch(url):
return requests.post('https://api.scavio.dev/api/v1/extract', headers=H, json={'url': url, 'format': 'markdown'}).json()Step 4: LLM transformation
Llama-3 (or any LLM) converts markdown to typed JSON.
# Prompt: 'Extract a strict JSON: {title, jurisdiction, deadline, summary, risk_level}.'
# Use Groq for cheap Llama-3, or Anthropic Sonnet for quality.Step 5: SQLite cache layer
Sub-50ms repeat lookups.
import sqlite3, json, time
conn = sqlite3.connect('daas.db')
conn.execute('CREATE TABLE IF NOT EXISTS items(url TEXT PRIMARY KEY, payload TEXT, ts REAL)')
def cache_set(url, payload):
conn.execute('INSERT OR REPLACE INTO items VALUES (?, ?, ?)', (url, json.dumps(payload), time.time()))
conn.commit()Step 6: Serve via MCP for downstream agents
Wrap the cache in a FastMCP server.
# from fastmcp import FastMCP
# mcp = FastMCP('daas')
# @mcp.tool()
# def get_item(url: str) -> dict:
# row = conn.execute('SELECT payload FROM items WHERE url=?', (url,)).fetchone()
# return json.loads(row[0]) if row else {}Python Example
# Wrap discover + fetch + transform + cache in a daily cron.
# Downstream CrewAI / LangChain agents query the MCP for sub-50ms typed JSON.JavaScript Example
// Same architecture in TS with better-sqlite3 and the MCP TS SDK.Expected Output
Daily 4 AM cron pulls dorks, fetches PDFs, transforms to typed JSON, caches in SQLite. Downstream agents read from cache in 50ms instead of running real-time scrapers.