Firecrawl is fine for 10-page crawls but teams running 10K+ page weekly refreshes hit rate limits and pricing walls. This tutorial migrates a large-crawl workload to Scavio's /crawl endpoint with higher concurrency and per-page pricing.
Prerequisites
- An existing Firecrawl workload to migrate
- A Scavio API key (paid tier recommended for concurrency)
- Python 3.10+
Walkthrough
Step 1: Inventory your current crawl
Export your Firecrawl seed URLs and frequency.
# From Firecrawl dashboard, export crawl job config:
SEEDS = ['https://docs.site.com']
DEPTH = 3Step 2: Queue the crawl in Scavio
Scavio returns a job_id for async polling.
import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']
def start_crawl(seed, depth):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': seed, 'platform': 'crawl', 'depth': depth, 'format': 'markdown'})
return r.json()['job_id']Step 3: Poll for completion
Scavio streams pages as they complete.
def poll(job_id):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': job_id, 'platform': 'crawl_status'})
return r.json()Step 4: Save pages as markdown
Same output format as Firecrawl, so downstream ingestion stays the same.
import os
def save(pages, outdir):
os.makedirs(outdir, exist_ok=True)
for i, p in enumerate(pages):
with open(f'{outdir}/page_{i}.md', 'w') as f:
f.write(p['markdown'])Step 5: Schedule weekly refresh
Cron or GitHub Actions kicks off the weekly crawl.
# .github/workflows/crawl.yml
on:
schedule: [{cron: '0 4 * * 1'}]
jobs:
crawl:
runs-on: ubuntu-latest
steps: [{run: python crawl.py}]Python Example
import os, requests, time
API_KEY = os.environ['SCAVIO_API_KEY']
def crawl(seed, depth=2):
start = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': seed, 'platform': 'crawl', 'depth': depth, 'format': 'markdown'})
job = start.json()['job_id']
while True:
s = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': job, 'platform': 'crawl_status'}).json()
if s['status'] == 'done': return s['pages']
time.sleep(5)
print(len(crawl('https://docs.example.com', depth=2)))JavaScript Example
const API_KEY = process.env.SCAVIO_API_KEY;
async function crawl(seed, depth = 2) {
const start = await (await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST',
headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ query: seed, platform: 'crawl', depth, format: 'markdown' })
})).json();
const job = start.job_id;
while (true) {
const s = await (await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST',
headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ query: job, platform: 'crawl_status' })
})).json();
if (s.status === 'done') return s.pages;
await new Promise(r => setTimeout(r, 5000));
}
}Expected Output
Weekly 10K-page crawl completes in 20-40 minutes. Markdown output identical to Firecrawl. Per-page cost: 1 credit.