Tutorial

How to Build a RAG Pipeline Without Scraping

Build a RAG pipeline using search APIs instead of web scrapers. Structured JSON from Scavio replaces Crawl4AI, SearXNG, or Firecrawl.

An r/Rag post asked what scraper to use for huge RAG data. The reframe: for public, indexed content, search APIs replace scrapers. No proxy management, no anti-bot fights, structured JSON from the start.

Prerequisites

  • Scavio API key
  • Vector database (Chroma, Pinecone, or Weaviate)
  • LLM API key

Walkthrough

Step 1: Generate seed queries

Create 50-200 seed queries for your knowledge domain.

Python
seed_queries = [
    'AI agent architecture patterns 2026',
    'multi-agent orchestration frameworks',
    'LLM tool calling best practices',
    # ... 50-200 queries covering your domain
]

Step 2: Fetch structured results from Scavio

Search Google + Reddit for each query.

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

def fetch_sources(query):
    google = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
        json={'platform': 'google', 'query': query}).json()
    reddit = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
        json={'platform': 'reddit', 'query': query}).json()
    return {'google': google, 'reddit': reddit}

Step 3: Extract and deduplicate content

Pull unique URLs, use /extract for full content if needed.

Python
seen_urls = set()
def extract_unique(results):
    docs = []
    for r in results.get('organic_results', []):
        if r['link'] not in seen_urls:
            seen_urls.add(r['link'])
            docs.append({'url': r['link'], 'title': r['title'], 'snippet': r['snippet']})
    return docs

Step 4: Chunk and embed

Split content into chunks and generate embeddings.

Python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
embeddings = OpenAIEmbeddings()

def process_doc(doc):
    chunks = splitter.split_text(doc['snippet'])
    return [(c, embeddings.embed_query(c)) for c in chunks]

Step 5: Query the RAG pipeline

Embed the query, retrieve relevant chunks, generate answer.

Python
def rag_query(question):
    q_emb = embeddings.embed_query(question)
    # Retrieve top-5 chunks from vector DB
    # Feed to LLM with: 'Answer based on these sources: {chunks}'
    # Return answer with source URLs

Python Example

Python
# Cost math: 200 seed queries × 2 platforms = 400 API calls = $2
# Each call returns 10 results = 4,000 unique sources
# Top 2,000 via /extract = ~$10 additional
# Total corpus build: ~$12 for 2,000 high-quality documents

JavaScript Example

JavaScript
const resp = await fetch('https://api.scavio.dev/api/v1/search', {
  method: 'POST', headers: {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'},
  body: JSON.stringify({platform: 'google', query: seedQuery})
});

Expected Output

JSON
RAG pipeline sourcing documents from Google + Reddit via Scavio. No scraping infrastructure, no proxy costs, structured JSON throughout.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Scavio API key. Vector database (Chroma, Pinecone, or Weaviate). LLM API key. A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

Build a RAG pipeline using search APIs instead of web scrapers. Structured JSON from Scavio replaces Crawl4AI, SearXNG, or Firecrawl.