ScavioScavio
产品定价文档
登录开始使用
  1. 首页
  2. 教程
  3. 如何构建没有验证码问题的数据管道
教程

如何构建没有验证码问题的数据管道

使用结构化 API 调用替换命中验证码的网络抓取工具。 Google、Amazon 和 Reddit 数据提取之前/之后的比较。

获取免费API密钥API文档

网络抓取工具在遇到验证码、IP 禁令和速率限制时就会崩溃。结构化 API 返回解析的 JSON,而不会出现任何这些问题,因为 API 提供程序在其端处理浏览器渲染、代理轮换和验证码解决。本教程将受验证码困扰的抓取管道迁移到干净的 API 调用,显示 Google、Amazon 和 Reddit 的前后情况。

前置条件

  • Python 3.8+
  • 请求库
  • 来自 scavio.dev 的 Scavio API 密钥
  • 要迁移的现有抓取管道(可选)

操作指南

步骤 1: 比较 scraper 与 API 方法

展示抓取问题以及 API 如何消除这些问题。

Python
import os, requests, time

API_KEY = os.environ['SCAVIO_API_KEY']
SH = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}

# --- BEFORE: Scraper approach (common failure modes) ---
# def scrape_google(query):
#     try:
#         r = requests.get(f'https://www.google.com/search?q={query}',
#             headers={'User-Agent': 'Mozilla/5.0'})
#         if r.status_code == 429: raise Exception('Rate limited')
#         if 'captcha' in r.text.lower(): raise Exception('CAPTCHA triggered')
#         # Parse HTML... breaks when Google changes layout
#     except Exception as e:
#         print(f'Scraper failed: {e}')  # This happens constantly

# --- AFTER: API approach (no CAPTCHAs, no parsing) ---
def search(query, platform=None):
    body = {'query': query, 'country_code': 'us'}
    if platform: body['platform'] = platform
    data = requests.post('https://api.scavio.dev/api/v1/search',
        headers=SH, json=body).json()
    return data.get('organic_results', [])

results = search('best python framework 2026')
print(f'API returned {len(results)} results. No CAPTCHA. No IP ban. No HTML parsing.')
for r in results[:3]:
    print(f'  {r["position"]}. {r["title"][:50]}')

步骤 2: 迁移 Google 数据提取

用结构化 API 调用替换 Google 抓取。

Python
def migrate_google_pipeline(queries):
    """Before: 50+ lines of scraping code, proxy rotation, CAPTCHA handling.
    After: 5 lines per query."""
    results = []
    failures = 0
    for query in queries:
        data = requests.post('https://api.scavio.dev/api/v1/search',
            headers=SH, json={'query': query, 'country_code': 'us'}).json()
        organic = data.get('organic_results', [])
        if organic:
            results.append({'query': query, 'results': len(organic),
                'top': organic[0]['title'][:50]})
        else:
            failures += 1
    success_rate = (len(queries) - failures) / len(queries) * 100
    print(f'Migrated Google pipeline:')
    print(f'  Queries: {len(queries)} | Success: {success_rate:.0f}% | Failures: {failures}')
    print(f'  Cost: ${len(queries) * 0.005:.3f}')
    print(f'  CAPTCHA blocks: 0 (vs typical 5-15% with scrapers)')
    print(f'  Lines of code: ~5 per query (vs ~50 with scraping + parsing)')
    return results

queries = ['python web framework', 'serp api 2026', 'best code editor',
           'react vs vue', 'machine learning tutorial']
migrate_google_pipeline(queries)

步骤 3: 迁移 Amazon 和 Reddit 提取

使用平台 API 参数替换特定于平台的抓取工具。

Python
def migrate_amazon_pipeline(products):
    """Before: Selenium + CAPTCHA solver + proxy rotation for Amazon.
    After: Same API, platform='amazon'."""
    for product in products:
        data = requests.post('https://api.scavio.dev/api/v1/search',
            headers=SH, json={'query': product, 'platform': 'amazon', 'country_code': 'us'}).json()
        results = data.get('organic_results', [])[:3]
        print(f'  Amazon: {product[:30]:30} | {len(results)} results | Top: {results[0].get("price", "N/A") if results else "N/A"}')

def migrate_reddit_pipeline(queries):
    """Before: Reddit rate limits + auth + JSON parsing.
    After: Same API, platform='reddit'."""
    for query in queries:
        data = requests.post('https://api.scavio.dev/api/v1/search',
            headers=SH, json={'query': query, 'platform': 'reddit', 'country_code': 'us'}).json()
        results = data.get('organic_results', [])[:3]
        print(f'  Reddit: {query[:30]:30} | {len(results)} discussions')

print('Amazon migration:')
migrate_amazon_pipeline(['wireless earbuds', 'laptop stand', 'usb hub'])
print('\nReddit migration:')
migrate_reddit_pipeline(['best serp api', 'python web scraping', 'api recommendation'])

步骤 4: 比较可靠性和成本

运行可靠性测试并计算节省的成本。

Python
def reliability_test(queries, platforms):
    total = 0
    success = 0
    start = time.time()
    for query in queries:
        for platform in platforms:
            total += 1
            body = {'query': query, 'country_code': 'us'}
            if platform != 'google': body['platform'] = platform
            try:
                r = requests.post('https://api.scavio.dev/api/v1/search',
                    headers=SH, json=body)
                if r.status_code == 200:
                    success += 1
            except: pass
    elapsed = time.time() - start
    cost = total * 0.005
    print(f'\n=== Pipeline Reliability Report ===')
    print(f'  Requests: {total} | Success: {success} ({success/total*100:.0f}%)')
    print(f'  Time: {elapsed:.1f}s | Avg: {elapsed/total:.2f}s per request')
    print(f'  Cost: ${cost:.3f}')
    print(f'  CAPTCHA blocks: 0')
    print(f'  IP bans: 0')
    print(f'  HTML parsing errors: 0')
    print(f'\n  vs Scraper estimate:')
    print(f'  Typical scraper success rate: 80-90%')
    print(f'  Proxy cost: $10-50/month')
    print(f'  CAPTCHA solver: $1-3/1000 solves')
    print(f'  Maintenance: 2-5 hours/month fixing broken selectors')

reliability_test(['serp api', 'web scraping'], ['google', 'amazon', 'reddit'])

Python 示例

Python
import os, requests
SH = {'x-api-key': os.environ['SCAVIO_API_KEY'], 'Content-Type': 'application/json'}

def pipeline(query, platform=None):
    body = {'query': query, 'country_code': 'us'}
    if platform: body['platform'] = platform
    data = requests.post('https://api.scavio.dev/api/v1/search',
        headers=SH, json=body).json()
    results = data.get('organic_results', [])
    print(f'{platform or "google"}: {len(results)} results, 0 CAPTCHAs. Cost: $0.005')

for p in [None, 'amazon', 'reddit']:
    pipeline('wireless earbuds', p)

JavaScript 示例

JavaScript
const SH = { 'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json' };
async function pipeline(query, platform) {
  const body = { query, country_code: 'us' };
  if (platform) body.platform = platform;
  const data = await fetch('https://api.scavio.dev/api/v1/search', {
    method: 'POST', headers: SH, body: JSON.stringify(body)
  }).then(r => r.json());
  console.log(`${platform || 'google'}: ${(data.organic_results || []).length} results, 0 CAPTCHAs`);
}
for (const p of [null, 'amazon', 'reddit']) await pipeline('wireless earbuds', p);

预期输出

JSON
API returned 10 results. No CAPTCHA. No IP ban. No HTML parsing.
  1. FastAPI - Modern Python Web Framework
  2. Django - The web framework for perfectionists

Migrated Google pipeline:
  Queries: 5 | Success: 100% | Failures: 0
  Cost: $0.025
  CAPTCHA blocks: 0
  Lines of code: ~5 per query (vs ~50 with scraping + parsing)

=== Pipeline Reliability Report ===
  Requests: 6 | Success: 6 (100%)
  CAPTCHA blocks: 0
  IP bans: 0

相关教程

  • 如何从 Web Scraper 迁移到结构化 API
  • 如何用结构化 API 替换 ScrapingAnt
  • 如何用结构化 API 替换浏览器自动化

常见问题

大多数开发者在15到30分钟内完成本教程。您需要一个Scavio API密钥(免费套餐即可)和可用的Python或JavaScript环境。

Python 3.8+. 请求库. 来自 scavio.dev 的 Scavio API 密钥. 要迁移的现有抓取管道(可选). Scavio API密钥注册即送50个免费积分。

可以。免费套餐注册即送50个积分,完全足够完成本教程并构建一个可运行的原型解决方案。

Scavio提供原生LangChain包(langchain-scavio)、MCP服务器以及适用于任何HTTP客户端的REST API。本教程使用 the raw REST API, 但您可以根据需要适配您选择的框架。

相关资源

Best Of

Google I/O 2026 AI模式变化后最佳搜索API

Read more
Glossary

搜索 API 供应商格局(2026)

Read more
Best Of

2026 年最佳 SERP API 提供商按价格排名

Read more
Glossary

免费搜索API层级对比

Read more
Comparison

Search APIs (Scavio, Tavily, SerpAPI) vs Headless Browser (Playwright, Puppeteer, Browserbase)

Read more
Comparison

Google Places API vs SERP Local Pack API

Read more

开始构建

使用结构化 API 调用替换命中验证码的网络抓取工具。 Google、Amazon 和 Reddit 数据提取之前/之后的比较。

获取免费API密钥阅读文档
ScavioScavio

面向AI智能体的实时搜索API。搜索所有平台,不仅仅是Google。

产品

  • 功能
  • 定价
  • 控制台
  • 联盟计划

开发者

  • 文档
  • API参考
  • 快速开始
  • MCP集成
  • Python SDK

替代方案

  • Tavily替代方案
  • SerpAPI替代方案
  • Firecrawl替代方案
  • Exa替代方案

工具

  • JSON格式化
  • cURL转代码
  • Token计数器
  • 全部工具

© 2026 Scavio. 保留所有权利。

Featured on TAAFT
服务条款隐私政策