Scraper success rate is the only metric that matters: the percent of target sites where you get clean, structured data. This tutorial runs a 500-site benchmark that hits a candidate scraper first, then Scavio as ground truth, and reports win rate, Cloudflare blocks, and empty responses.
Prerequisites
- Python 3.10+
- A Scavio API key
- A candidate scraper (your own or a competitor's)
- A 500-URL test list
Walkthrough
Step 1: Build the URL panel
500 URLs across Cloudflare-protected, JS-heavy, and simple static sites.
import csv
with open('panel.csv') as f:
URLS = [row[0] for row in csv.reader(f)]Step 2: Define the benchmark loop
Hit candidate first, then Scavio, record outcomes.
import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']
def candidate_scrape(url):
try:
return requests.get(url, timeout=10).text
except: return ''
def scavio_scrape(url):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': url, 'platform': 'extract'})
return r.json().get('html', '')Step 3: Score each outcome
Clean HTML with content = success.
def is_success(html):
return len(html) > 500 and '<body' in html.lower()Step 4: Run the benchmark
Collect per-URL pass/fail for each tool.
results = []
for u in URLS:
cand = is_success(candidate_scrape(u))
scav = is_success(scavio_scrape(u))
results.append({'url': u, 'candidate': cand, 'scavio': scav})Step 5: Publish the results
Success rate, Cloudflare block rate, and gap.
def summarize(r):
n = len(r)
return {
'candidate_rate': sum(1 for x in r if x['candidate']) / n,
'scavio_rate': sum(1 for x in r if x['scavio']) / n
}
print(summarize(results))Python Example
import os, requests
API_KEY = os.environ['SCAVIO_API_KEY']
URLS = ['https://example.com', 'https://cloudflare-protected.com']
def scavio_extract(url):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': url, 'platform': 'extract'})
return r.json().get('html', '')
wins = sum(1 for u in URLS if len(scavio_extract(u)) > 500)
print(f'Scavio success: {wins}/{len(URLS)}')JavaScript Example
const API_KEY = process.env.SCAVIO_API_KEY;
const URLS = ['https://example.com', 'https://cloudflare-protected.com'];
async function scavioExtract(url) {
const r = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST',
headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ query: url, platform: 'extract' })
});
const d = await r.json();
return d.html || '';
}
let wins = 0;
for (const u of URLS) if ((await scavioExtract(u)).length > 500) wins++;
console.log(`Scavio success: ${wins}/${URLS.length}`);Expected Output
Per-tool success rate (e.g., candidate 62%, Scavio 94%), Cloudflare block rate breakdown, and per-URL diff. Typical benchmark run: 20-30 minutes for 500 URLs.