r/Rag 帖子询问使用什么抓取工具来处理大量 RAG 数据。重构:对于公共索引内容,搜索 API 取代了抓取工具。没有代理管理,没有反机器人战斗,从一开始就是结构化的 JSON。
前置条件
- Scavio API 密钥
- 矢量数据库(Chroma、Pinecone 或 Weaviate)
- 法学硕士 API 密钥
操作指南
步骤 1: 生成种子查询
为您的知识领域创建 50-200 个种子查询。
Python
seed_queries = [
'AI agent architecture patterns 2026',
'multi-agent orchestration frameworks',
'LLM tool calling best practices',
# ... 50-200 queries covering your domain
]步骤 2: 从 Scavio 获取结构化结果
在 Google + Reddit 中搜索每个查询。
Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
def fetch_sources(query):
google = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
json={'platform': 'google', 'query': query}).json()
reddit = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
json={'platform': 'reddit', 'query': query}).json()
return {'google': google, 'reddit': reddit}步骤 3: 提取和删除重复内容
提取唯一的 URL,如果需要,请使用 /extract 获取完整内容。
Python
seen_urls = set()
def extract_unique(results):
docs = []
for r in results.get('organic_results', []):
if r['link'] not in seen_urls:
seen_urls.add(r['link'])
docs.append({'url': r['link'], 'title': r['title'], 'snippet': r['snippet']})
return docs步骤 4: 分块和嵌入
将内容分割成块并生成嵌入。
Python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
embeddings = OpenAIEmbeddings()
def process_doc(doc):
chunks = splitter.split_text(doc['snippet'])
return [(c, embeddings.embed_query(c)) for c in chunks]步骤 5: 查询 RAG 管道
嵌入查询,检索相关块,生成答案。
Python
def rag_query(question):
q_emb = embeddings.embed_query(question)
# Retrieve top-5 chunks from vector DB
# Feed to LLM with: 'Answer based on these sources: {chunks}'
# Return answer with source URLsPython 示例
Python
# Cost math: 200 seed queries × 2 platforms = 400 API calls = $2
# Each call returns 10 results = 4,000 unique sources
# Top 2,000 via /extract = ~$10 additional
# Total corpus build: ~$12 for 2,000 high-quality documentsJavaScript 示例
JavaScript
const resp = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST', headers: {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'},
body: JSON.stringify({platform: 'google', query: seedQuery})
});预期输出
JSON
RAG pipeline sourcing documents from Google + Reddit via Scavio. No scraping infrastructure, no proxy costs, structured JSON throughout.