Solution

News Sentiment Corpus Builder

Building ML training sets for sentiment analysis requires large volumes of labeled news text. Manually collecting and labeling news articles is expensive and slow. Existing news AP

The Problem

Building ML training sets for sentiment analysis requires large volumes of labeled news text. Manually collecting and labeling news articles is expensive and slow. Existing news APIs charge per article and lack the structured metadata (source, date, topic) needed for clean training sets. Teams end up with small, biased corpora that do not generalize well across topics and sources.

The Scavio Solution

Build an automated corpus builder using Scavio's Google News search. Query news-related keywords, extract titles and snippets from organic results, and use the structured metadata (source domain, publication date, position) as features. The pipeline produces clean training sets with thousands of labeled examples at search API prices instead of news API prices.

Before

Before: A data science team used a dedicated news API at $0.05/article to build a sentiment corpus. Building a 10K article training set cost $500 and took 2 weeks of curation. The corpus was biased toward English-language sources from the news API's limited index.

After

After: The same team uses Scavio Google News queries to build corpora at $0.005/query. Each query returns 10+ results, so 1K queries ($5) produce a 10K+ snippet corpus. Monthly corpus refresh costs $5 instead of $500. Build time dropped from 2 weeks to 2 hours.

Who It Is For

Data scientists and ML engineers building sentiment analysis models who need affordable, structured news corpora. NLP researchers collecting training data at scale.

Key Benefits

  • Build 10K+ snippet corpora for $5 instead of $500 via news APIs
  • Structured metadata (source, date, position) included with every result
  • Refresh corpora monthly at negligible cost for model retraining
  • Google News index covers broader sources than dedicated news APIs
  • Reduce corpus build time from weeks to hours with automated pipelines

Python Example

Python
import requests
import json
from datetime import date

API_KEY = "your_scavio_api_key"

def build_corpus(topics: list, results_per_topic: int = 10) -> list:
    corpus = []
    for topic in topics:
        r = requests.post(
            "https://api.scavio.dev/api/v1/search",
            headers={"x-api-key": API_KEY},
            json={"platform": "google", "query": f"{topic} news 2026"},
            timeout=10,
        )
        data = r.json()
        for item in data.get("organic", [])[:results_per_topic]:
            corpus.append({
                "text": f"{item["title"]}. {item.get("snippet", "")}",
                "source": item.get("link", "").split("/")[2] if "/" in item.get("link", "") else "",
                "topic": topic,
                "date_collected": str(date.today()),
                "position": item.get("position"),
            })
    return corpus

topics = ["artificial intelligence regulation", "climate tech funding", "semiconductor shortage"]
corpus = build_corpus(topics)
print(f"Corpus size: {len(corpus)} entries")
with open("news_corpus.json", "w") as f:
    json.dump(corpus, f, indent=2)

JavaScript Example

JavaScript
const API_KEY = "your_scavio_api_key";

async function buildCorpus(topics, resultsPerTopic = 10) {
  const corpus = [];
  for (const topic of topics) {
    const res = await fetch("https://api.scavio.dev/api/v1/search", {
      method: "POST",
      headers: { "x-api-key": API_KEY, "content-type": "application/json" },
      body: JSON.stringify({ platform: "google", query: `${topic} news 2026` }),
    });
    const data = await res.json();
    for (const item of (data.organic || []).slice(0, resultsPerTopic)) {
      corpus.push({
        text: `${item.title}. ${item.snippet || ""}`,
        source: item.link ? new URL(item.link).hostname : "",
        topic,
        dateCollected: new Date().toISOString().split("T")[0],
        position: item.position,
      });
    }
  }
  return corpus;
}

const topics = ["artificial intelligence regulation", "climate tech funding", "semiconductor shortage"];
const corpus = await buildCorpus(topics);
console.log(`Corpus size: ${corpus.length} entries`);

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Frequently Asked Questions

Building ML training sets for sentiment analysis requires large volumes of labeled news text. Manually collecting and labeling news articles is expensive and slow. Existing news APIs charge per article and lack the structured metadata (source, date, topic) needed for clean training sets. Teams end up with small, biased corpora that do not generalize well across topics and sources.

Build an automated corpus builder using Scavio's Google News search. Query news-related keywords, extract titles and snippets from organic results, and use the structured metadata (source domain, publication date, position) as features. The pipeline produces clean training sets with thousands of labeled examples at search API prices instead of news API prices.

Data scientists and ML engineers building sentiment analysis models who need affordable, structured news corpora. NLP researchers collecting training data at scale.

Yes. Scavio's free tier includes 250 credits per month with no credit card required. That is enough to validate this solution in your workflow.

News Sentiment Corpus Builder

Build an automated corpus builder using Scavio's Google News search. Query news-related keywords, extract titles and snippets from organic results, and use the structured metadata