Playwright and Puppeteer are powerful but slow, expensive, and brittle for data extraction from known platforms. A structured API returns the same data in milliseconds without browser overhead, proxy costs, or CAPTCHA handling. This tutorial shows which use cases you can replace immediately and which still need browser automation, with honest tradeoffs.
Prerequisites
- Python 3.8+
- requests library
- A Scavio API key from scavio.dev
- Existing Playwright/Puppeteer code (optional)
Walkthrough
Step 1: Identify which browser automation to replace
Categorize your browser automation by what can move to API and what cannot.
import os, requests
API_KEY = os.environ['SCAVIO_API_KEY']
SH = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}
# CAN REPLACE with API:
replaceable = {
'Google search scraping': 'Scavio search API (google platform)',
'Amazon product scraping': 'Scavio search API (amazon platform)',
'Reddit thread scraping': 'Scavio search API (reddit platform)',
'YouTube search scraping': 'Scavio search API (youtube platform)',
'Walmart product scraping': 'Scavio search API (walmart platform)',
'TikTok profile scraping': 'Scavio TikTok API (profile endpoint)',
'TikTok video data': 'Scavio TikTok API (user/posts endpoint)',
'Google Maps data': 'Scavio search API (local_results field)',
}
# STILL NEED BROWSER:
need_browser = {
'Custom web apps': 'No structured API for proprietary sites',
'Login-required pages': 'API cannot authenticate to private accounts',
'Interactive forms': 'Form submissions need browser context',
'Screenshot capture': 'Visual rendering requires a browser',
'Cookie-dependent flows': 'Session state needs browser persistence',
}
print('Replaceable with API:')
for task, api in replaceable.items():
print(f' {task:35} -> {api}')
print(f'\nStill needs browser ({len(need_browser)} cases):')
for task, reason in need_browser.items():
print(f' {task:35} | {reason}')Step 2: Side-by-side code comparison
Compare Playwright browser code vs API calls for common tasks.
# BEFORE: Playwright Google scraping (~20 lines, 3-5 seconds)
# from playwright.async_api import async_playwright
# async def scrape_google(query):
# async with async_playwright() as p:
# browser = await p.chromium.launch(headless=True)
# page = await browser.new_page()
# await page.goto(f'https://www.google.com/search?q={query}')
# await page.wait_for_selector('div.g')
# results = await page.query_selector_all('div.g')
# data = []
# for r in results[:10]:
# title = await r.query_selector('h3')
# link = await r.query_selector('a')
# data.append({'title': await title.inner_text() if title else '',
# 'link': await link.get_attribute('href') if link else ''})
# await browser.close()
# return data # Takes 3-5 seconds, breaks on layout changes
# AFTER: API call (~3 lines, <1 second)
def search_google(query):
data = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': query, 'country_code': 'us'}).json()
return data.get('organic_results', [])
import time
start = time.time()
results = search_google('python web framework 2026')
elapsed = time.time() - start
print(f'API: {len(results)} results in {elapsed:.2f}s')
print(f'vs Playwright: ~3-5 seconds + browser memory + proxy cost')Step 3: Migrate a real scraping pipeline
Step-by-step migration of a multi-page scraper to API calls.
def migrate_pipeline():
"""Migrate a typical multi-page scraping pipeline to API."""
# Step 1: Replace search scraping
google_results = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': 'wireless earbuds', 'country_code': 'us'}).json()
print(f'Google: {len(google_results.get("organic_results", []))} results')
# Step 2: Replace Amazon scraping
amazon_results = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': 'wireless earbuds', 'platform': 'amazon', 'country_code': 'us'}).json()
print(f'Amazon: {len(amazon_results.get("organic_results", []))} products')
# Step 3: Replace Reddit scraping
reddit_results = requests.post('https://api.scavio.dev/api/v1/search',
headers=SH, json={'query': 'wireless earbuds review', 'platform': 'reddit', 'country_code': 'us'}).json()
print(f'Reddit: {len(reddit_results.get("organic_results", []))} discussions')
# Step 4: Replace page content extraction
if google_results.get('organic_results'):
url = google_results['organic_results'][0].get('link', '')
if url:
extract = requests.post('https://api.scavio.dev/api/v1/extract',
headers=SH, json={'url': url}).json()
print(f'Extract: {len(str(extract.get("content", "")))} chars from {url[:40]}')
print(f'\nTotal cost: $0.020 (4 API calls)')
print(f'Total time: <2 seconds')
print(f'Browser instances: 0')
print(f'Proxy cost: $0')
print(f'CAPTCHA blocks: 0')
migrate_pipeline()Step 4: Compare cost and performance
Calculate total cost of ownership for browser vs API approaches.
def tco_comparison(monthly_pages):
print(f'\n=== Total Cost of Ownership ({monthly_pages:,} pages/month) ===')
# Playwright/Puppeteer costs
browser_server = 50 # Cloud server for browsers
proxy = 30 # Proxy service
captcha = monthly_pages * 0.05 * 0.002 # 5% CAPTCHA rate, $0.002/solve
maintenance = 8 * 50 # 8 hours/month @ $50/hr fixing selectors
browser_total = browser_server + proxy + captcha + maintenance
print(f'\n BROWSER AUTOMATION:')
print(f' Server (headless Chrome): ${browser_server}/mo')
print(f' Proxy service: ${proxy}/mo')
print(f' CAPTCHA solving (~5%): ${captcha:.2f}/mo')
print(f' Maintenance (selector fixes): ${maintenance}/mo')
print(f' Total: ${browser_total:.2f}/mo')
# API costs
api_cost = monthly_pages * 0.005
print(f'\n STRUCTURED API:')
print(f' Scavio API: ${api_cost:.2f}/mo ({monthly_pages:,} x $0.005)')
print(f' Server: $0 (runs anywhere)')
print(f' Proxy: $0 (not needed)')
print(f' CAPTCHA: $0 (not needed)')
print(f' Maintenance: ~$0 (stable JSON)')
print(f' Total: ${api_cost:.2f}/mo')
savings = browser_total - api_cost
print(f'\n SAVINGS: ${savings:.2f}/mo ({savings/browser_total*100:.0f}%)')
print(f' SPEED: ~0.5s/request (API) vs ~3-5s/page (browser)')
print(f' RELIABILITY: 99%+ (API) vs 85-95% (browser)')
tco_comparison(5000)
tco_comparison(20000)Python Example
import os, requests, time
SH = {'x-api-key': os.environ['SCAVIO_API_KEY'], 'Content-Type': 'application/json'}
# Replace Playwright/Puppeteer with:
start = time.time()
for platform in [None, 'amazon', 'reddit']:
body = {'query': 'wireless earbuds', 'country_code': 'us'}
if platform: body['platform'] = platform
data = requests.post('https://api.scavio.dev/api/v1/search', headers=SH, json=body).json()
print(f'{platform or "google"}: {len(data.get("organic_results", []))} results')
print(f'Time: {time.time()-start:.2f}s | Cost: $0.015 | Browser: none')JavaScript Example
const SH = { 'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json' };
// Replace Puppeteer with:
const start = Date.now();
for (const platform of [null, 'amazon', 'reddit']) {
const body = { query: 'wireless earbuds', country_code: 'us' };
if (platform) body.platform = platform;
const data = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST', headers: SH, body: JSON.stringify(body)
}).then(r => r.json());
console.log(`${platform || 'google'}: ${(data.organic_results || []).length} results`);
}
console.log(`Time: ${(Date.now()-start)/1000}s | Cost: $0.015 | Browser: none`);Expected Output
Replaceable with API:
Google search scraping -> Scavio search API (google platform)
Amazon product scraping -> Scavio search API (amazon platform)
Reddit thread scraping -> Scavio search API (reddit platform)
Still needs browser (5 cases):
Custom web apps | No structured API for proprietary sites
Login-required pages | API cannot authenticate to private accounts
API: 10 results in 0.45s
vs Playwright: ~3-5 seconds + browser memory + proxy cost
=== Total Cost of Ownership (5,000 pages/month) ===
BROWSER AUTOMATION: $480.50/mo
STRUCTURED API: $25.00/mo
SAVINGS: $455.50/mo (95%)