Overview
This pipeline implements an API-first approach to data collection: try Scavio structured API for every query first, and only fall back to browser automation (Selenium, Playwright) for the rare cases where structured data is insufficient. Most teams find that 90%+ of their scraping jobs can be replaced by API calls, dramatically reducing infrastructure costs and maintenance. The fallback path ensures complete coverage while minimizing browser automation usage.
Trigger
On-demand or scheduled per data collection batch
Schedule
On-demand per data collection batch
Workflow Steps
Load data collection tasks
Read the batch of URLs or queries that need data collection from the task queue.
Attempt API-first collection
For each task, try to fulfill it via Scavio structured API (Google, Amazon, etc.).
Evaluate API result completeness
Check if the API result contains all required fields. Mark tasks as complete or needing fallback.
Run browser fallback for remaining tasks
For tasks the API could not fully satisfy, fall back to browser automation as a last resort.
Log API vs browser ratio
Record how many tasks used API vs browser to track migration progress over time.
Python Implementation
import requests
import json
from pathlib import Path
from datetime import datetime
API_KEY = "your_scavio_api_key"
TASKS = [
{"id": "t1", "query": "best CRM software 2026", "platform": "google", "required_fields": ["title", "link", "snippet"]},
{"id": "t2", "query": "wireless mouse", "platform": "amazon", "required_fields": ["title", "price", "rating"]},
{"id": "t3", "query": "python tutorial 2026", "platform": "google", "required_fields": ["title", "link"]},
]
def try_api(query: str, platform: str, required: list[str]) -> dict | None:
res = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": API_KEY},
json={"platform": platform, "query": query},
timeout=15,
)
res.raise_for_status()
results = res.json().get("organic", [])
if not results:
return None
top = results[0]
# Check all required fields are present
if all(top.get(f) for f in required):
return {"method": "api", "data": {f: top[f] for f in required}}
return None
def browser_fallback(query: str) -> dict:
"""Placeholder for Selenium/Playwright fallback."""
return {"method": "browser", "data": {"note": "browser automation would run here"}}
def run():
date = datetime.utcnow().strftime("%Y-%m-%d")
results = []
api_count = 0
browser_count = 0
for task in TASKS:
api_result = try_api(task["query"], task["platform"], task["required_fields"])
if api_result:
results.append({"task_id": task["id"], **api_result})
api_count += 1
else:
fallback = browser_fallback(task["query"])
results.append({"task_id": task["id"], **fallback})
browser_count += 1
total = api_count + browser_count
api_pct = round(api_count / total * 100) if total else 0
output = {"date": date, "total_tasks": total, "api_fulfilled": api_count, "browser_fallback": browser_count, "api_percentage": api_pct, "results": results}
Path(f"collection_batch_{date}.json").write_text(json.dumps(output, indent=2))
print(f"Batch {date}: {api_count}/{total} via API ({api_pct}%), {browser_count} fallback to browser")
if __name__ == "__main__":
run()JavaScript Implementation
const API_KEY = "your_scavio_api_key";
const TASKS = [
{ id: "t1", query: "best CRM software 2026", platform: "google", required: ["title", "link"] },
{ id: "t2", query: "wireless mouse", platform: "amazon", required: ["title", "price"] },
];
async function tryApi(query, platform, required) {
const res = await fetch("https://api.scavio.dev/api/v1/search", {
method: "POST",
headers: { "x-api-key": API_KEY, "content-type": "application/json" },
body: JSON.stringify({ platform, query }),
});
if (!res.ok) return null;
const top = ((await res.json()).organic ?? [])[0];
if (!top || !required.every((f) => top[f])) return null;
return { method: "api", data: Object.fromEntries(required.map((f) => [f, top[f]])) };
}
let apiCount = 0;
for (const task of TASKS) {
const result = await tryApi(task.query, task.platform, task.required);
if (result) apiCount++;
}
console.log(`${apiCount}/${TASKS.length} tasks fulfilled via API (${Math.round(apiCount / TASKS.length * 100)}%)`);Platforms Used
Web search with knowledge graph, PAA, and AI overviews