How long does this use common crawl with live search in a hybrid pipeline tutorial take?

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

What do I need before starting?

Python 3.9+ installed. requests library installed. A Scavio API key from scavio.dev. Basic understanding of Common Crawl index. A Scavio API key gives you 250 free credits per month.

Can I run this tutorial with the free tier?

Yes. The free tier includes 250 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

What frameworks does this work with?

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Common Crawl + Live Search Hybrid Pipeline (2026)

Common Crawl provides petabytes of free web archive data, but it is months old and requires heavy processing. Live search APIs return fresh results instantly but cost per query. This tutorial combines both into a hybrid pipeline: use Common Crawl for bulk historical analysis and Scavio for real-time freshness. The hybrid approach gives you the scale of Common Crawl with the currency of live search at minimal cost.

Prerequisites

Python 3.9+ installed
requests library installed
A Scavio API key from scavio.dev
Basic understanding of Common Crawl index

Walkthrough

Step 1: Query the Common Crawl Index API

Search Common Crawl's CDX index to find historical web pages matching your query. The index is free to query and returns URLs of archived pages.

Python

import requests

CC_INDEX_URL = 'https://index.commoncrawl.org/CC-MAIN-2026-04-index'

def search_common_crawl(domain: str, limit: int = 20) -> list:
    """Search Common Crawl index for pages from a domain."""
    resp = requests.get(CC_INDEX_URL,
        params={'url': f'{domain}/*', 'output': 'json', 'limit': limit},
        timeout=30)
    if resp.status_code != 200:
        print(f'Common Crawl returned {resp.status_code}')
        return []
    lines = resp.text.strip().split('\n')
    results = []
    for line in lines:
        try:
            import json
            record = json.loads(line)
            results.append({
                'url': record.get('url', ''),
                'timestamp': record.get('timestamp', ''),
                'status': record.get('status', ''),
                'mime': record.get('mime', ''),
                'length': record.get('length', 0),
            })
        except json.JSONDecodeError:
            continue
    return results

# Search for pages from a domain
pages = search_common_crawl('docs.python.org')
print(f'Found {len(pages)} archived pages from docs.python.org')
for p in pages[:5]:
    print(f'  [{p["timestamp"]}] {p["url"][:60]}')

Step 2: Build the hybrid search function

Combine Common Crawl data with live search results. Use Common Crawl for historical coverage and live search for fresh results.

Python

import os

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']

def live_search(query: str, count: int = 5) -> list:
    resp = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'},
        json={'query': query, 'country_code': 'us', 'num_results': count})
    return [{'url': r['link'], 'title': r['title'],
             'snippet': r.get('snippet', ''), 'source': 'live_search'}
            for r in resp.json().get('organic_results', [])]

def hybrid_search(query: str, domain: str = None) -> dict:
    """Combine Common Crawl historical data with live results."""
    results = {'historical': [], 'live': [], 'cost': 0}
    # Step 1: Check Common Crawl for historical data (free)
    if domain:
        cc_results = search_common_crawl(domain, limit=10)
        results['historical'] = [{
            'url': r['url'], 'timestamp': r['timestamp'],
            'source': 'common_crawl'
        } for r in cc_results]
    # Step 2: Get fresh live results ($0.005)
    live = live_search(query, count=5)
    results['live'] = live
    results['cost'] = 0.005
    results['total'] = len(results['historical']) + len(results['live'])
    return results

result = hybrid_search('python 3.14 new features', domain='docs.python.org')
print(f'Historical (Common Crawl): {len(result["historical"])} pages')
print(f'Live (Scavio): {len(result["live"])} results')
print(f'Total: {result["total"]} | Cost: ${result["cost"]}')

Step 3: Analyze coverage gaps between historical and live data

Compare what Common Crawl has versus what live search returns. This reveals content that exists but is not in the archive, and archived content that may have been removed.

Python

def analyze_coverage(domain: str, queries: list) -> dict:
    # Get Common Crawl inventory
    cc_urls = set(r['url'] for r in search_common_crawl(domain, limit=50))
    # Get live search URLs for the domain
    live_urls = set()
    for query in queries:
        live = live_search(f'site:{domain} {query}', count=5)
        live_urls.update(r['url'] for r in live)
    # Analyze overlap
    both = cc_urls & live_urls
    cc_only = cc_urls - live_urls
    live_only = live_urls - cc_urls
    print(f'Domain: {domain}')
    print(f'Common Crawl pages: {len(cc_urls)}')
    print(f'Live search pages: {len(live_urls)}')
    print(f'In both: {len(both)}')
    print(f'CC only (may be removed): {len(cc_only)}')
    print(f'Live only (new content): {len(live_only)}')
    print(f'Cost: {len(queries)} live searches = ${len(queries) * 0.005:.3f}')
    return {'cc_only': cc_only, 'live_only': live_only, 'both': both}

coverage = analyze_coverage('docs.python.org', ['asyncio', 'pathlib', 'dataclasses'])

Python Example

Python

import requests, os, json

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']
CC_INDEX = 'https://index.commoncrawl.org/CC-MAIN-2026-04-index'

def hybrid_search(query, domain=None):
    results = []
    # Common Crawl (free)
    if domain:
        cc = requests.get(CC_INDEX, params={'url': f'{domain}/*', 'output': 'json', 'limit': 10}, timeout=30)
        for line in cc.text.strip().split('\n'):
            try:
                r = json.loads(line)
                results.append({'url': r['url'], 'source': 'common_crawl'})
            except Exception:
                pass
    # Live search ($0.005)
    live = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'},
        json={'query': query, 'country_code': 'us', 'num_results': 5})
    for r in live.json().get('organic_results', []):
        results.append({'url': r['link'], 'title': r['title'], 'source': 'live'})
    print(f'{len(results)} total results (cost: $0.005)')
    return results

hybrid_search('python asyncio', 'docs.python.org')

JavaScript Example

JavaScript

const SCAVIO_KEY = process.env.SCAVIO_API_KEY;

async function hybridSearch(query, domain) {
  const results = [];
  if (domain) {
    const cc = await fetch(`https://index.commoncrawl.org/CC-MAIN-2026-04-index?url=${domain}/*&output=json&limit=10`);
    const lines = (await cc.text()).trim().split('\n');
    lines.forEach(line => { try { results.push({...JSON.parse(line), source: 'cc'}); } catch {} });
  }
  const live = await fetch('https://api.scavio.dev/api/v1/search', {
    method: 'POST',
    headers: { 'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ query, country_code: 'us', num_results: 5 })
  });
  for (const r of (await live.json()).organic_results || []) {
    results.push({ url: r.link, title: r.title, source: 'live' });
  }
  console.log(`${results.length} results (cost: $0.005)`);
  return results;
}

hybridSearch('python asyncio', 'docs.python.org');

Expected Output

JSON

Found 15 archived pages from docs.python.org
  [20260115] https://docs.python.org/3/library/asyncio.html
  [20260115] https://docs.python.org/3/whatsnew/3.14.html

Historical (Common Crawl): 15 pages
Live (Scavio): 5 results
Total: 20 | Cost: $0.005

Domain: docs.python.org
Common Crawl pages: 50
Live search pages: 12
In both: 8
CC only (may be removed): 42
Live only (new content): 4
Cost: 3 live searches = $0.015

How to Use Common Crawl with Live Search in a Hybrid Pipeline

Prerequisites

Walkthrough

Step 1: Query the Common Crawl Index API

Step 2: Build the hybrid search function

Step 3: Analyze coverage gaps between historical and live data

Python Example

JavaScript Example

Expected Output

Related Tutorials

Frequently Asked Questions

How long does this use common crawl with live search in a hybrid pipeline tutorial take?

What do I need before starting?

Can I run this tutorial with the free tier?

What frameworks does this work with?

Start Building