Playwright 和 Puppeteer 功能强大,但速度缓慢、昂贵且脆弱,无法从已知平台提取数据。结构化 API 在几毫秒内返回相同的数据,无需浏览器开销、代理成本或验证码处理。本教程展示了哪些用例可以立即替换,哪些用例仍然需要浏览器自动化,并进行诚实的权衡。
前置条件
- Python 3.8+
- 请求库
- 来自 scavio.dev 的 Scavio API 密钥
- 现有剧作家/木偶师代码(可选)
操作指南
步骤 1: 确定要替换的浏览器自动化
根据哪些内容可以迁移到 API,哪些内容不能迁移,对浏览器自动化进行分类。
Python
import os, requests
API_KEY = os.environ['SCAVIO_API_KEY']
SH = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}
# CAN REPLACE with API:
replaceable = {
'Google search scraping': 'Scavio search API (google platform)',
'Amazon product scraping': 'Scavio search API (amazon platform)',
'Reddit thread scraping': 'Scavio search API (reddit platform)',
'YouTube search scraping': 'Scavio search API (youtube platform)',
'Walmart product scraping': 'Scavio search API (walmart platform)',
'TikTok profile scraping': 'Scavio TikTok API (profile endpoint)',
'TikTok video data': 'Scavio TikTok API (user/posts endpoint)',
'Google Maps data': 'Scavio search API (local_results field)',
}
# STILL NEED BROWSER:
need_browser = {
'Custom web apps': 'No structured API for proprietary sites',
'Login-required pages': 'API cannot authenticate to private accounts',
'Interactive forms': 'Form submissions need browser context',
'Screenshot capture': 'Visual rendering requires a browser',
'Cookie-dependent flows': 'Session state needs browser persistence',
}
print('Replaceable with API:')
for task, api in replaceable.items():
print(f' {task:35} -> {api}')
print(f'\nStill needs browser ({len(need_browser)} cases):')
for task, reason in need_browser.items():
print(f' {task:35} | {reason}')步骤 2: 并排代码比较
比较 Playwright 浏览器代码与常见任务的 API 调用。
Python
# BEFORE: Playwright Google scraping (~20 lines, 3-5 seconds)
# from playwright.async_api import async_playwright
# async def scrape_google(query):
# async with async_playwright() as p:
# browser = await p.chromium.launch(headless=True)
# page = await browser.new_page()
# await page.goto(f'https://www.google.com/search?q={query}')
# await page.wait_for_selector('div.g')
# results = await page.query_selector_all('div.g')
# data = []
# for r in results[:10]:
# title = await r.query_selector('h3')
# link = await r.query_selector('a')
# data.append({'title': await title.inner_text() if title else '',
# 'link': await link.get_attribute('href') if link else ''})
# await browser.close()
# return data # Takes 3-5 seconds, breaks on layout changes
# AFTER: API call (~3 lines, <1 second)
def search_google(query):
data = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': query, 'country_code': 'us'}).json()
return data.get('organic_results', [])
import time
start = time.time()
results = search_google('python web framework 2026')
elapsed = time.time() - start
print(f'API: {len(results)} results in {elapsed:.2f}s')
print(f'vs Playwright: ~3-5 seconds + browser memory + proxy cost')步骤 3: 迁移真实的抓取管道
逐步将多页面抓取工具迁移到 API 调用。
Python
def migrate_pipeline():
"""Migrate a typical multi-page scraping pipeline to API."""
# Step 1: Replace search scraping
google_results = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': 'wireless earbuds', 'country_code': 'us'}).json()
print(f'Google: {len(google_results.get("organic_results", []))} results')
# Step 2: Replace Amazon scraping
amazon_results = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': 'wireless earbuds', 'platform': 'amazon', 'country_code': 'us'}).json()
print(f'Amazon: {len(amazon_results.get("organic_results", []))} products')
# Step 3: Replace Reddit scraping
reddit_results = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': 'wireless earbuds review', 'platform': 'reddit', 'country_code': 'us'}).json()
print(f'Reddit: {len(reddit_results.get("organic_results", []))} discussions')
# Step 4: Replace page content extraction
if google_results.get('organic_results'):
url = google_results['organic_results'][0].get('link', '')
if url:
extract = requests.post('https://api.scavio.dev/api/v1/extract',
headers=SH, json={'url': url}).json()
print(f'Extract: {len(str(extract.get("content", "")))} chars from {url[:40]}')
print(f'\nTotal cost: $0.020 (4 API calls)')
print(f'Total time: <2 seconds')
print(f'Browser instances: 0')
print(f'Proxy cost: $0')
print(f'CAPTCHA blocks: 0')
migrate_pipeline()步骤 4: 比较成本和性能
计算浏览器与 API 方法的总拥有成本。
Python
def tco_comparison(monthly_pages):
print(f'\n=== Total Cost of Ownership ({monthly_pages:,} pages/month) ===')
# Playwright/Puppeteer costs
browser_server = 50 # Cloud server for browsers
proxy = 30 # Proxy service
captcha = monthly_pages * 0.05 * 0.002 # 5% CAPTCHA rate, $0.002/solve
maintenance = 8 * 50 # 8 hours/month @ $50/hr fixing selectors
browser_total = browser_server + proxy + captcha + maintenance
print(f'\n BROWSER AUTOMATION:')
print(f' Server (headless Chrome): ${browser_server}/mo')
print(f' Proxy service: ${proxy}/mo')
print(f' CAPTCHA solving (~5%): ${captcha:.2f}/mo')
print(f' Maintenance (selector fixes): ${maintenance}/mo')
print(f' Total: ${browser_total:.2f}/mo')
# API costs
api_cost = monthly_pages * 0.005
print(f'\n STRUCTURED API:')
print(f' Scavio API: ${api_cost:.2f}/mo ({monthly_pages:,} x $0.005)')
print(f' Server: $0 (runs anywhere)')
print(f' Proxy: $0 (not needed)')
print(f' CAPTCHA: $0 (not needed)')
print(f' Maintenance: ~$0 (stable JSON)')
print(f' Total: ${api_cost:.2f}/mo')
savings = browser_total - api_cost
print(f'\n SAVINGS: ${savings:.2f}/mo ({savings/browser_total*100:.0f}%)')
print(f' SPEED: ~0.5s/request (API) vs ~3-5s/page (browser)')
print(f' RELIABILITY: 99%+ (API) vs 85-95% (browser)')
tco_comparison(5000)
tco_comparison(20000)Python 示例
Python
import os, requests, time
SH = {'x-api-key': os.environ['SCAVIO_API_KEY'], 'Content-Type': 'application/json'}
# Replace Playwright/Puppeteer with:
start = time.time()
for platform in [None, 'amazon', 'reddit']:
body = {'query': 'wireless earbuds', 'country_code': 'us'}
if platform: body['platform'] = platform
data = requests.post('https://api.scavio.dev/api/v1/search', headers=SH, json=body).json()
print(f'{platform or "google"}: {len(data.get("organic_results", []))} results')
print(f'Time: {time.time()-start:.2f}s | Cost: $0.015 | Browser: none')JavaScript 示例
JavaScript
const SH = { 'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json' };
// Replace Puppeteer with:
const start = Date.now();
for (const platform of [null, 'amazon', 'reddit']) {
const body = { query: 'wireless earbuds', country_code: 'us' };
if (platform) body.platform = platform;
const data = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST', headers: SH, body: JSON.stringify(body)
}).then(r => r.json());
console.log(`${platform || 'google'}: ${(data.organic_results || []).length} results`);
}
console.log(`Time: ${(Date.now()-start)/1000}s | Cost: $0.015 | Browser: none`);预期输出
JSON
Replaceable with API:
Google search scraping -> Scavio search API (google platform)
Amazon product scraping -> Scavio search API (amazon platform)
Reddit thread scraping -> Scavio search API (reddit platform)
Still needs browser (5 cases):
Custom web apps | No structured API for proprietary sites
Login-required pages | API cannot authenticate to private accounts
API: 10 results in 0.45s
vs Playwright: ~3-5 seconds + browser memory + proxy cost
=== Total Cost of Ownership (5,000 pages/month) ===
BROWSER AUTOMATION: $480.50/mo
STRUCTURED API: $25.00/mo
SAVINGS: $455.50/mo (95%)