解析 Google、Reddit 或 Amazon HTML 的网络抓取工具是任何数据管道中最脆弱的部分。当目标网站改变布局时,你的抓取工具就会损坏。当他们检测到您的流量时,您就会被阻止。当规模扩大时,代理成本就会飙升。结构化搜索 API 返回与干净 JSON 相同的数据,无需解析、无需代理、无需维护。本教程演示如何逐步使用 Scavio 的 API 替换典型的抓取工具。
前置条件
- 已安装 Python 3.8+
- 您想要迁移的现有抓取工具(BeautifulSoup、Playwright 或 Selenium)
- 来自 scavio.dev 的 Scavio API 密钥
操作指南
步骤 1: 审核您的抓取工具的数据输出
确定您的抓取工具当前提取的字段。大多数 Google 抓取工具都会提取:标题、URL、片段、位置。
# Typical scraper output:
# [
# {'title': '...', 'url': '...', 'snippet': '...', 'position': 1},
# {'title': '...', 'url': '...', 'snippet': '...', 'position': 2},
# ]
#
# Scavio's 'organic' array returns the same fields:
# [
# {'title': '...', 'link': '...', 'snippet': '...', 'position': 1},
# ]
# Only difference: 'url' -> 'link'步骤 2: 替换抓取功能
用单个 API 调用替换您的抓取代码。
import requests, os
# BEFORE: 150 lines of scraping code
# from bs4 import BeautifulSoup
# import random
# PROXIES = [...]
# def scrape_google(query):
# proxy = random.choice(PROXIES)
# resp = requests.get(f'https://www.google.com/search?q={query}',
# proxies={'https': proxy}, headers={'User-Agent': ...})
# soup = BeautifulSoup(resp.text, 'html.parser')
# results = []
# for div in soup.select('div.g'):
# ... # 100 lines of parsing
# AFTER: 10 lines
def search_google(query: str) -> list:
resp = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': os.environ['SCAVIO_API_KEY']},
json={'platform': 'google', 'query': query}, timeout=10)
return [{'title': r['title'], 'url': r['link'], 'snippet': r['snippet'], 'position': r.get('position', i+1)}
for i, r in enumerate(resp.json().get('organic', []))]步骤 3: 更新下游字段引用
如果您的代码引用了特定于抓取工具的字段名称,请更新它们。
# Find all references to the old scraper output format:
# grep -r 'scrape_google\|from scraper\|import scraper' .
# Common field mapping:
# Old scraper -> Scavio API
# result.url -> result.link
# result.desc -> result.snippet
# result.rank -> result.position步骤 4: 删除代理和解析器依赖项
清理您的需求文件并删除抓取基础设施。
# Remove from requirements.txt:
# beautifulsoup4
# lxml
# playwright
# selenium
# webdriver-manager
# fake-useragent
# rotating-proxies
# Remove proxy configuration files
# Cancel proxy subscription (saves $50-200/month)
# Your requirements.txt now just needs:
# requestsPython 示例
# Migration summary:
# Before: 150 lines + proxy subscription + maintenance
# After: 10 lines + $0.003/query + zero maintenance
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
def search(query, platform='google'):
return requests.post('https://api.scavio.dev/api/v1/search',
headers=H, json={'platform': platform, 'query': query},
timeout=10).json().get('organic', [])JavaScript 示例
// Before: Playwright + proxy rotation + HTML parsing
// After:
async function search(query, platform = 'google') {
const resp = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST', headers: {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'},
body: JSON.stringify({platform, query})
});
return (await resp.json()).organic || [];
}预期输出
A clean search function replacing hundreds of lines of scraping code. No proxies, no parsing, no maintenance.