The Problem
Building ML training sets for sentiment analysis requires large volumes of labeled news text. Manually collecting and labeling news articles is expensive and slow. Existing news APIs charge per article and lack the structured metadata (source, date, topic) needed for clean training sets. Teams end up with small, biased corpora that do not generalize well across topics and sources.
The Scavio Solution
Build an automated corpus builder using Scavio's Google News search. Query news-related keywords, extract titles and snippets from organic results, and use the structured metadata (source domain, publication date, position) as features. The pipeline produces clean training sets with thousands of labeled examples at search API prices instead of news API prices.
Before
Before: A data science team used a dedicated news API at $0.05/article to build a sentiment corpus. Building a 10K article training set cost $500 and took 2 weeks of curation. The corpus was biased toward English-language sources from the news API's limited index.
After
After: The same team uses Scavio Google News queries to build corpora at $0.005/query. Each query returns 10+ results, so 1K queries ($5) produce a 10K+ snippet corpus. Monthly corpus refresh costs $5 instead of $500. Build time dropped from 2 weeks to 2 hours.
Who It Is For
Data scientists and ML engineers building sentiment analysis models who need affordable, structured news corpora. NLP researchers collecting training data at scale.
Key Benefits
- Build 10K+ snippet corpora for $5 instead of $500 via news APIs
- Structured metadata (source, date, position) included with every result
- Refresh corpora monthly at negligible cost for model retraining
- Google News index covers broader sources than dedicated news APIs
- Reduce corpus build time from weeks to hours with automated pipelines
Python Example
import requests
import json
from datetime import date
API_KEY = "your_scavio_api_key"
def build_corpus(topics: list, results_per_topic: int = 10) -> list:
corpus = []
for topic in topics:
r = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": API_KEY},
json={"platform": "google", "query": f"{topic} news 2026"},
timeout=10,
)
data = r.json()
for item in data.get("organic", [])[:results_per_topic]:
corpus.append({
"text": f"{item["title"]}. {item.get("snippet", "")}",
"source": item.get("link", "").split("/")[2] if "/" in item.get("link", "") else "",
"topic": topic,
"date_collected": str(date.today()),
"position": item.get("position"),
})
return corpus
topics = ["artificial intelligence regulation", "climate tech funding", "semiconductor shortage"]
corpus = build_corpus(topics)
print(f"Corpus size: {len(corpus)} entries")
with open("news_corpus.json", "w") as f:
json.dump(corpus, f, indent=2)JavaScript Example
const API_KEY = "your_scavio_api_key";
async function buildCorpus(topics, resultsPerTopic = 10) {
const corpus = [];
for (const topic of topics) {
const res = await fetch("https://api.scavio.dev/api/v1/search", {
method: "POST",
headers: { "x-api-key": API_KEY, "content-type": "application/json" },
body: JSON.stringify({ platform: "google", query: `${topic} news 2026` }),
});
const data = await res.json();
for (const item of (data.organic || []).slice(0, resultsPerTopic)) {
corpus.push({
text: `${item.title}. ${item.snippet || ""}`,
source: item.link ? new URL(item.link).hostname : "",
topic,
dateCollected: new Date().toISOString().split("T")[0],
position: item.position,
});
}
}
return corpus;
}
const topics = ["artificial intelligence regulation", "climate tech funding", "semiconductor shortage"];
const corpus = await buildCorpus(topics);
console.log(`Corpus size: ${corpus.length} entries`);Platforms Used
Web search with knowledge graph, PAA, and AI overviews