构建直接网络抓取工具需要管理旋转代理池、解决验证码、处理 JavaScript 渲染以及解析原始 HTML——所有这些都需要大量的工程工作和持续的维护。 Scavio API 是一种托管搜索数据服务,可在服务器端处理所有此类基础设施。您进行一次经过身份验证的 HTTP POST 并接收结构化 JSON。本教程将传统的基于代理的方法与 Scavio 方法进行比较,并展示如何从抓取工具迁移到 API。
前置条件
- Python 3.8 或更高版本
- 请求已安装库
- Scavio API 密钥
- 对 HTTP 请求的基本了解
操作指南
步骤 1: 传统的刮取方式(之前)
典型的抓取工具需要代理配置、用户代理轮换和 HTML 解析——当网站发生变化时,所有这些都容易中断。
# Traditional approach — fragile and requires proxy infrastructure
import requests
from bs4 import BeautifulSoup
proxies = {"http": "http://user:pass@proxy:8080", "https": "http://user:pass@proxy:8080"}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."}
response = requests.get("https://www.google.com/search?q=python+tutorial",
proxies=proxies, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
# Fragile: class names change without notice
results = soup.find_all("div", class_="tF2Cxc")步骤 2: Scavio 方法(之后)
用单个 API 调用替换抓取工具。无需代理、无需 HTML 解析、无需维护。
# Scavio approach — stable, structured, no infrastructure
import requests
response = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": "your_scavio_api_key"},
json={"query": "python tutorial", "country_code": "us"}
)
results = response.json()["organic_results"]步骤 3: 使用指数退避处理重试
Scavio API 很可靠,但为网络错误添加了一个简单的重试包装器。
import time
def search_with_retry(query: str, max_retries: int = 3) -> dict:
for attempt in range(max_retries):
try:
r = requests.post(ENDPOINT, headers={"x-api-key": API_KEY},
json={"query": query, "country_code": "us"}, timeout=30)
r.raise_for_status()
return r.json()
except requests.RequestException as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
return {}步骤 4: 验证响应架构
添加简单的架构检查以确保响应在处理之前包含预期的字段。
def validate_response(data: dict) -> bool:
required = ["organic_results"]
return all(k in data for k in required)
data = search_with_retry("python tutorial")
if validate_response(data):
for r in data["organic_results"][:5]:
print(r["title"], r["link"])Python 示例
import os
import time
import requests
API_KEY = os.environ.get("SCAVIO_API_KEY", "your_scavio_api_key")
ENDPOINT = "https://api.scavio.dev/api/v1/search"
def search(query: str, retries: int = 3) -> dict:
for i in range(retries):
try:
r = requests.post(ENDPOINT, headers={"x-api-key": API_KEY},
json={"query": query, "country_code": "us"}, timeout=30)
r.raise_for_status()
return r.json()
except requests.RequestException:
if i < retries - 1:
time.sleep(2 ** i)
else:
raise
return {}
if __name__ == "__main__":
# No proxies, no HTML parsing, no CAPTCHA solving
data = search("python tutorial")
for r in data.get("organic_results", [])[:5]:
print(f"{r['title']}\n{r['link']}\n")JavaScript 示例
const API_KEY = process.env.SCAVIO_API_KEY || "your_scavio_api_key";
const ENDPOINT = "https://api.scavio.dev/api/v1/search";
async function search(query, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
const res = await fetch(ENDPOINT, {
method: "POST",
headers: { "x-api-key": API_KEY, "Content-Type": "application/json" },
body: JSON.stringify({ query, country_code: "us" })
});
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
} catch (e) {
if (i === retries - 1) throw e;
await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
}
}
}
// No proxies, no HTML parsing
search("python tutorial").then(data => {
(data.organic_results || []).slice(0, 5).forEach(r => console.log(`${r.title}\n${r.link}\n`));
}).catch(console.error);预期输出
Traditional approach: 47 lines, 3 dependencies, breaks monthly
Scavio approach: 8 lines, 1 dependency, stable
Sample output:
{
"organic_results": [
{ "position": 1, "title": "Python Tutorial — W3Schools", "link": "https://w3schools.com/python/" },
{ "position": 2, "title": "The Python Tutorial — Python Docs", "link": "https://docs.python.org/tutorial/" }
]
}