Firecrawl 适合 10 页爬行,但每周运行 10K+ 页面刷新的团队会达到速率限制和定价墙。本教程将大型爬网工作负载迁移到 Scavio 的 /crawl 端点,具有更高的并发性和每页定价。
前置条件
- 要迁移的现有 Firecrawl 工作负载
- Scavio API 密钥(建议使用付费层以实现并发)
- Python 3.10+
操作指南
步骤 1: 盘点当前的抓取
导出您的 Firecrawl 种子 URL 和频率。
Python
# From Firecrawl dashboard, export crawl job config:
SEEDS = ['https://docs.site.com']
DEPTH = 3步骤 2: 在 Scavio 中对爬网进行排队
Scavio 返回用于异步轮询的 job_id。
Python
import requests, os
API_KEY = os.environ['SCAVIO_API_KEY']
def start_crawl(seed, depth):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': seed, 'platform': 'crawl', 'depth': depth, 'format': 'markdown'})
return r.json()['job_id']步骤 3: 投票完成
Scavio 在页面完成时对页面进行流式传输。
Python
def poll(job_id):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': job_id, 'platform': 'crawl_status'})
return r.json()步骤 4: 将页面另存为 markdown
输出格式与 Firecrawl 相同,因此下游摄取保持不变。
Python
import os
def save(pages, outdir):
os.makedirs(outdir, exist_ok=True)
for i, p in enumerate(pages):
with open(f'{outdir}/page_{i}.md', 'w') as f:
f.write(p['markdown'])步骤 5: 安排每周刷新
Cron 或 GitHub Actions 启动每周爬行。
# .github/workflows/crawl.yml
on:
schedule: [{cron: '0 4 * * 1'}]
jobs:
crawl:
runs-on: ubuntu-latest
steps: [{run: python crawl.py}]Python 示例
Python
import os, requests, time
API_KEY = os.environ['SCAVIO_API_KEY']
def crawl(seed, depth=2):
start = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': seed, 'platform': 'crawl', 'depth': depth, 'format': 'markdown'})
job = start.json()['job_id']
while True:
s = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': job, 'platform': 'crawl_status'}).json()
if s['status'] == 'done': return s['pages']
time.sleep(5)
print(len(crawl('https://docs.example.com', depth=2)))JavaScript 示例
JavaScript
const API_KEY = process.env.SCAVIO_API_KEY;
async function crawl(seed, depth = 2) {
const start = await (await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST',
headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ query: seed, platform: 'crawl', depth, format: 'markdown' })
})).json();
const job = start.job_id;
while (true) {
const s = await (await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST',
headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ query: job, platform: 'crawl_status' })
})).json();
if (s.status === 'done') return s.pages;
await new Promise(r => setTimeout(r, 5000));
}
}预期输出
JSON
Weekly 10K-page crawl completes in 20-40 minutes. Markdown output identical to Firecrawl. Per-page cost: 1 credit.