Tutorial

How to Fetch Web Search Data Without Managing Proxies

Learn why the Scavio managed API eliminates proxy management, CAPTCHA solving, and IP rotation for web search data. Simple HTTP POST, structured JSON output.

Building a direct web scraper requires managing a rotating proxy pool, solving CAPTCHAs, handling JavaScript rendering, and parsing raw HTML — all of which require significant engineering effort and ongoing maintenance. The Scavio API is a managed search data service that handles all of this infrastructure server-side. You make a single authenticated HTTP POST and receive structured JSON. This tutorial compares the traditional proxy-based approach to the Scavio approach and shows how to migrate from a scraper to the API.

Prerequisites

  • Python 3.8 or higher
  • requests library installed
  • A Scavio API key
  • Basic understanding of HTTP requests

Walkthrough

Step 1: The traditional scraping approach (before)

A typical scraper requires proxy configuration, user-agent rotation, and HTML parsing — all prone to breaking when sites change.

Python
# Traditional approach — fragile and requires proxy infrastructure
import requests
from bs4 import BeautifulSoup

proxies = {"http": "http://user:pass@proxy:8080", "https": "http://user:pass@proxy:8080"}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)..."}
response = requests.get("https://www.google.com/search?q=python+tutorial",
                        proxies=proxies, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
# Fragile: class names change without notice
results = soup.find_all("div", class_="tF2Cxc")

Step 2: The Scavio approach (after)

Replace the scraper with a single API call. No proxies, no HTML parsing, no maintenance.

Python
# Scavio approach — stable, structured, no infrastructure
import requests

response = requests.post(
    "https://api.scavio.dev/api/v1/search",
    headers={"x-api-key": "your_scavio_api_key"},
    json={"query": "python tutorial", "country_code": "us"}
)
results = response.json()["organic_results"]

Step 3: Handle retries with exponential backoff

The Scavio API is reliable, but add a simple retry wrapper for network errors.

Python
import time

def search_with_retry(query: str, max_retries: int = 3) -> dict:
    for attempt in range(max_retries):
        try:
            r = requests.post(ENDPOINT, headers={"x-api-key": API_KEY},
                              json={"query": query, "country_code": "us"}, timeout=30)
            r.raise_for_status()
            return r.json()
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    return {}

Step 4: Validate the response schema

Add a simple schema check to ensure the response contains expected fields before processing.

Python
def validate_response(data: dict) -> bool:
    required = ["organic_results"]
    return all(k in data for k in required)

data = search_with_retry("python tutorial")
if validate_response(data):
    for r in data["organic_results"][:5]:
        print(r["title"], r["link"])

Python Example

Python
import os
import time
import requests

API_KEY = os.environ.get("SCAVIO_API_KEY", "your_scavio_api_key")
ENDPOINT = "https://api.scavio.dev/api/v1/search"

def search(query: str, retries: int = 3) -> dict:
    for i in range(retries):
        try:
            r = requests.post(ENDPOINT, headers={"x-api-key": API_KEY},
                              json={"query": query, "country_code": "us"}, timeout=30)
            r.raise_for_status()
            return r.json()
        except requests.RequestException:
            if i < retries - 1:
                time.sleep(2 ** i)
            else:
                raise
    return {}

if __name__ == "__main__":
    # No proxies, no HTML parsing, no CAPTCHA solving
    data = search("python tutorial")
    for r in data.get("organic_results", [])[:5]:
        print(f"{r['title']}\n{r['link']}\n")

JavaScript Example

JavaScript
const API_KEY = process.env.SCAVIO_API_KEY || "your_scavio_api_key";
const ENDPOINT = "https://api.scavio.dev/api/v1/search";

async function search(query, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      const res = await fetch(ENDPOINT, {
        method: "POST",
        headers: { "x-api-key": API_KEY, "Content-Type": "application/json" },
        body: JSON.stringify({ query, country_code: "us" })
      });
      if (!res.ok) throw new Error(`HTTP ${res.status}`);
      return res.json();
    } catch (e) {
      if (i === retries - 1) throw e;
      await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
    }
  }
}

// No proxies, no HTML parsing
search("python tutorial").then(data => {
  (data.organic_results || []).slice(0, 5).forEach(r => console.log(`${r.title}\n${r.link}\n`));
}).catch(console.error);

Expected Output

JSON
Traditional approach: 47 lines, 3 dependencies, breaks monthly

Scavio approach: 8 lines, 1 dependency, stable

Sample output:
{
  "organic_results": [
    { "position": 1, "title": "Python Tutorial — W3Schools", "link": "https://w3schools.com/python/" },
    { "position": 2, "title": "The Python Tutorial — Python Docs", "link": "https://docs.python.org/tutorial/" }
  ]
}

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.8 or higher. requests library installed. A Scavio API key. Basic understanding of HTTP requests. A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

Learn why the Scavio managed API eliminates proxy management, CAPTCHA solving, and IP rotation for web search data. Simple HTTP POST, structured JSON output.