Tutorial

How to Replace Browser Automation with Structured API

Replace Playwright and Puppeteer with structured API calls for data extraction. When API works, when you still need browser automation.

Playwright and Puppeteer are powerful but slow, expensive, and brittle for data extraction from known platforms. A structured API returns the same data in milliseconds without browser overhead, proxy costs, or CAPTCHA handling. This tutorial shows which use cases you can replace immediately and which still need browser automation, with honest tradeoffs.

Prerequisites

  • Python 3.8+
  • requests library
  • A Scavio API key from scavio.dev
  • Existing Playwright/Puppeteer code (optional)

Walkthrough

Step 1: Identify which browser automation to replace

Categorize your browser automation by what can move to API and what cannot.

Python
import os, requests

API_KEY = os.environ['SCAVIO_API_KEY']
SH = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}

# CAN REPLACE with API:
replaceable = {
    'Google search scraping': 'Scavio search API (google platform)',
    'Amazon product scraping': 'Scavio search API (amazon platform)',
    'Reddit thread scraping': 'Scavio search API (reddit platform)',
    'YouTube search scraping': 'Scavio search API (youtube platform)',
    'Walmart product scraping': 'Scavio search API (walmart platform)',
    'TikTok profile scraping': 'Scavio TikTok API (profile endpoint)',
    'TikTok video data': 'Scavio TikTok API (user/posts endpoint)',
    'Google Maps data': 'Scavio search API (local_results field)',
}

# STILL NEED BROWSER:
need_browser = {
    'Custom web apps': 'No structured API for proprietary sites',
    'Login-required pages': 'API cannot authenticate to private accounts',
    'Interactive forms': 'Form submissions need browser context',
    'Screenshot capture': 'Visual rendering requires a browser',
    'Cookie-dependent flows': 'Session state needs browser persistence',
}

print('Replaceable with API:')
for task, api in replaceable.items():
    print(f'  {task:35} -> {api}')
print(f'\nStill needs browser ({len(need_browser)} cases):')
for task, reason in need_browser.items():
    print(f'  {task:35} | {reason}')

Step 2: Side-by-side code comparison

Compare Playwright browser code vs API calls for common tasks.

Python
# BEFORE: Playwright Google scraping (~20 lines, 3-5 seconds)
# from playwright.async_api import async_playwright
# async def scrape_google(query):
#     async with async_playwright() as p:
#         browser = await p.chromium.launch(headless=True)
#         page = await browser.new_page()
#         await page.goto(f'https://www.google.com/search?q={query}')
#         await page.wait_for_selector('div.g')
#         results = await page.query_selector_all('div.g')
#         data = []
#         for r in results[:10]:
#             title = await r.query_selector('h3')
#             link = await r.query_selector('a')
#             data.append({'title': await title.inner_text() if title else '',
#                          'link': await link.get_attribute('href') if link else ''})
#         await browser.close()
#         return data  # Takes 3-5 seconds, breaks on layout changes

# AFTER: API call (~3 lines, <1 second)
def search_google(query):
    data = requests.post('https://api.scavio.dev/api/v1/search',
        headers=SH, json={'query': query, 'country_code': 'us'}).json()
    return data.get('organic_results', [])

import time
start = time.time()
results = search_google('python web framework 2026')
elapsed = time.time() - start
print(f'API: {len(results)} results in {elapsed:.2f}s')
print(f'vs Playwright: ~3-5 seconds + browser memory + proxy cost')

Step 3: Migrate a real scraping pipeline

Step-by-step migration of a multi-page scraper to API calls.

Python
def migrate_pipeline():
    """Migrate a typical multi-page scraping pipeline to API."""
    # Step 1: Replace search scraping
    google_results = requests.post('https://api.scavio.dev/api/v1/search',
        headers=SH, json={'query': 'wireless earbuds', 'country_code': 'us'}).json()
    print(f'Google: {len(google_results.get("organic_results", []))} results')

    # Step 2: Replace Amazon scraping
    amazon_results = requests.post('https://api.scavio.dev/api/v1/search',
        headers=SH, json={'query': 'wireless earbuds', 'platform': 'amazon', 'country_code': 'us'}).json()
    print(f'Amazon: {len(amazon_results.get("organic_results", []))} products')

    # Step 3: Replace Reddit scraping
    reddit_results = requests.post('https://api.scavio.dev/api/v1/search',
        headers=SH, json={'query': 'wireless earbuds review', 'platform': 'reddit', 'country_code': 'us'}).json()
    print(f'Reddit: {len(reddit_results.get("organic_results", []))} discussions')

    # Step 4: Replace page content extraction
    if google_results.get('organic_results'):
        url = google_results['organic_results'][0].get('link', '')
        if url:
            extract = requests.post('https://api.scavio.dev/api/v1/extract',
                headers=SH, json={'url': url}).json()
            print(f'Extract: {len(str(extract.get("content", "")))} chars from {url[:40]}')

    print(f'\nTotal cost: $0.020 (4 API calls)')
    print(f'Total time: <2 seconds')
    print(f'Browser instances: 0')
    print(f'Proxy cost: $0')
    print(f'CAPTCHA blocks: 0')

migrate_pipeline()

Step 4: Compare cost and performance

Calculate total cost of ownership for browser vs API approaches.

Python
def tco_comparison(monthly_pages):
    print(f'\n=== Total Cost of Ownership ({monthly_pages:,} pages/month) ===')
    # Playwright/Puppeteer costs
    browser_server = 50  # Cloud server for browsers
    proxy = 30  # Proxy service
    captcha = monthly_pages * 0.05 * 0.002  # 5% CAPTCHA rate, $0.002/solve
    maintenance = 8 * 50  # 8 hours/month @ $50/hr fixing selectors
    browser_total = browser_server + proxy + captcha + maintenance
    print(f'\n  BROWSER AUTOMATION:')
    print(f'    Server (headless Chrome): ${browser_server}/mo')
    print(f'    Proxy service: ${proxy}/mo')
    print(f'    CAPTCHA solving (~5%): ${captcha:.2f}/mo')
    print(f'    Maintenance (selector fixes): ${maintenance}/mo')
    print(f'    Total: ${browser_total:.2f}/mo')
    # API costs
    api_cost = monthly_pages * 0.005
    print(f'\n  STRUCTURED API:')
    print(f'    Scavio API: ${api_cost:.2f}/mo ({monthly_pages:,} x $0.005)')
    print(f'    Server: $0 (runs anywhere)')
    print(f'    Proxy: $0 (not needed)')
    print(f'    CAPTCHA: $0 (not needed)')
    print(f'    Maintenance: ~$0 (stable JSON)')
    print(f'    Total: ${api_cost:.2f}/mo')
    savings = browser_total - api_cost
    print(f'\n  SAVINGS: ${savings:.2f}/mo ({savings/browser_total*100:.0f}%)')
    print(f'  SPEED: ~0.5s/request (API) vs ~3-5s/page (browser)')
    print(f'  RELIABILITY: 99%+ (API) vs 85-95% (browser)')

tco_comparison(5000)
tco_comparison(20000)

Python Example

Python
import os, requests, time
SH = {'x-api-key': os.environ['SCAVIO_API_KEY'], 'Content-Type': 'application/json'}

# Replace Playwright/Puppeteer with:
start = time.time()
for platform in [None, 'amazon', 'reddit']:
    body = {'query': 'wireless earbuds', 'country_code': 'us'}
    if platform: body['platform'] = platform
    data = requests.post('https://api.scavio.dev/api/v1/search', headers=SH, json=body).json()
    print(f'{platform or "google"}: {len(data.get("organic_results", []))} results')
print(f'Time: {time.time()-start:.2f}s | Cost: $0.015 | Browser: none')

JavaScript Example

JavaScript
const SH = { 'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json' };
// Replace Puppeteer with:
const start = Date.now();
for (const platform of [null, 'amazon', 'reddit']) {
  const body = { query: 'wireless earbuds', country_code: 'us' };
  if (platform) body.platform = platform;
  const data = await fetch('https://api.scavio.dev/api/v1/search', {
    method: 'POST', headers: SH, body: JSON.stringify(body)
  }).then(r => r.json());
  console.log(`${platform || 'google'}: ${(data.organic_results || []).length} results`);
}
console.log(`Time: ${(Date.now()-start)/1000}s | Cost: $0.015 | Browser: none`);

Expected Output

JSON
Replaceable with API:
  Google search scraping              -> Scavio search API (google platform)
  Amazon product scraping             -> Scavio search API (amazon platform)
  Reddit thread scraping              -> Scavio search API (reddit platform)

Still needs browser (5 cases):
  Custom web apps                     | No structured API for proprietary sites
  Login-required pages                | API cannot authenticate to private accounts

API: 10 results in 0.45s
vs Playwright: ~3-5 seconds + browser memory + proxy cost

=== Total Cost of Ownership (5,000 pages/month) ===
  BROWSER AUTOMATION: $480.50/mo
  STRUCTURED API: $25.00/mo
  SAVINGS: $455.50/mo (95%)

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.8+. requests library. A Scavio API key from scavio.dev. Existing Playwright/Puppeteer code (optional). A Scavio API key gives you 250 free credits per month.

Yes. The free tier includes 250 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

Replace Playwright and Puppeteer with structured API calls for data extraction. When API works, when you still need browser automation.