Tutorial

How to Benchmark Scrapers by Success Rate Across 500 Sites

Benchmark your scraper's success rate across 500 real sites with Scavio acting as the ground-truth fallback.

Scraper success rate is the only metric that matters: the percent of target sites where you get clean, structured data. This tutorial runs a 500-site benchmark that hits a candidate scraper first, then Scavio as ground truth, and reports win rate, Cloudflare blocks, and empty responses.

Prerequisites

  • Python 3.10+
  • A Scavio API key
  • A candidate scraper (your own or a competitor's)
  • A 500-URL test list

Walkthrough

Step 1: Build the URL panel

500 URLs across Cloudflare-protected, JS-heavy, and simple static sites.

Python
import csv
with open('panel.csv') as f:
    URLS = [row[0] for row in csv.reader(f)]

Step 2: Define the benchmark loop

Hit candidate first, then Scavio, record outcomes.

Python
import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']

def candidate_scrape(url):
    try:
        return requests.get(url, timeout=10).text
    except: return ''

def scavio_scrape(url):
    r = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': url, 'platform': 'extract'})
    return r.json().get('html', '')

Step 3: Score each outcome

Clean HTML with content = success.

Python
def is_success(html):
    return len(html) > 500 and '<body' in html.lower()

Step 4: Run the benchmark

Collect per-URL pass/fail for each tool.

Python
results = []
for u in URLS:
    cand = is_success(candidate_scrape(u))
    scav = is_success(scavio_scrape(u))
    results.append({'url': u, 'candidate': cand, 'scavio': scav})

Step 5: Publish the results

Success rate, Cloudflare block rate, and gap.

Python
def summarize(r):
    n = len(r)
    return {
        'candidate_rate': sum(1 for x in r if x['candidate']) / n,
        'scavio_rate': sum(1 for x in r if x['scavio']) / n
    }
print(summarize(results))

Python Example

Python
import os, requests

API_KEY = os.environ['SCAVIO_API_KEY']
URLS = ['https://example.com', 'https://cloudflare-protected.com']

def scavio_extract(url):
    r = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': url, 'platform': 'extract'})
    return r.json().get('html', '')

wins = sum(1 for u in URLS if len(scavio_extract(u)) > 500)
print(f'Scavio success: {wins}/{len(URLS)}')

JavaScript Example

JavaScript
const API_KEY = process.env.SCAVIO_API_KEY;
const URLS = ['https://example.com', 'https://cloudflare-protected.com'];

async function scavioExtract(url) {
  const r = await fetch('https://api.scavio.dev/api/v1/search', {
    method: 'POST',
    headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ query: url, platform: 'extract' })
  });
  const d = await r.json();
  return d.html || '';
}

let wins = 0;
for (const u of URLS) if ((await scavioExtract(u)).length > 500) wins++;
console.log(`Scavio success: ${wins}/${URLS.length}`);

Expected Output

JSON
Per-tool success rate (e.g., candidate 62%, Scavio 94%), Cloudflare block rate breakdown, and per-URL diff. Typical benchmark run: 20-30 minutes for 500 URLs.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.10+. A Scavio API key. A candidate scraper (your own or a competitor's). A 500-URL test list. A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

Benchmark your scraper's success rate across 500 real sites with Scavio acting as the ground-truth fallback.