Building ML Sentiment Datasets from News via Search API
GDELT quality is inconsistent for ML training. Search API returns deduplicated news snippets you can label directly. Python pipeline included.
Building ML sentiment training datasets from news is a well-known approach, but the data collection step is where most projects stall. GDELT gives you massive event data, but quality is inconsistent: duplicate articles, broken URLs, and inconsistent formatting make cleaning a full-time job. Google News scraping breaks every few weeks when the DOM changes. The simpler path is using a search API to collect news snippets and summaries, then feeding those directly into your labeling pipeline.
Why GDELT Falls Short for Sentiment Training
GDELT indexes 300,000+ articles daily across 65 languages. That sounds ideal until you start building a training corpus. The problems show up fast: about 15-20% of article URLs are dead within 48 hours. The tone scores GDELT provides are computed using a dictionary-based method from 2013 that misclassifies sarcasm, hedged language, and domain-specific jargon. If you are training a modern transformer-based sentiment model, you need raw text with human labels, not pre-computed scores from an outdated lexicon.
The other issue is deduplication. GDELT treats syndicated articles as separate events, so a single AP wire story about a Fed rate decision might appear 40-80 times. Your model trains on the same sentence patterns repeatedly, overfitting to wire service style rather than learning generalizable sentiment features.
The Search API Approach
A search API returns deduplicated results with clean snippets. You query a topic, get back 10-50 results with titles, descriptions, and source URLs. The descriptions are already human-readable summaries, which means you skip the HTML extraction and cleaning step entirely.
The workflow is straightforward: define your topics, run searches at regular intervals, store the results, then label them for sentiment. Each search costs $0.005 with Scavio, so 1,000 topic queries cost $5 and yield 10,000-50,000 labeled training samples after annotation.
Building the Collection Pipeline
Here is a Python pipeline that collects news snippets for a list of topics and stores them in a structured format ready for labeling:
import requests
import json
import time
from datetime import datetime
API_KEY = "your_scavio_api_key"
BASE_URL = "https://api.scavio.dev/api/v1/search"
TOPICS = [
"federal reserve interest rate decision",
"tesla earnings report analysis",
"oil price forecast OPEC",
"tech layoffs 2026 impact",
"inflation consumer sentiment survey",
]
def collect_news_snippets(topic, num_results=20):
"""Search for a topic and return structured snippets."""
response = requests.post(
BASE_URL,
headers={"x-api-key": API_KEY, "Content-Type": "application/json"},
json={
"query": f"{topic} news",
"num_results": num_results,
},
)
response.raise_for_status()
data = response.json()
snippets = []
for result in data.get("results", []):
snippets.append({
"topic": topic,
"title": result.get("title", ""),
"snippet": result.get("description", ""),
"source": result.get("url", ""),
"collected_at": datetime.utcnow().isoformat(),
"sentiment_label": None, # To be filled during annotation
})
return snippets
def build_corpus(topics, output_file="sentiment_corpus.jsonl"):
"""Collect snippets for all topics and save as JSONL."""
all_snippets = []
for topic in topics:
print(f"Collecting: {topic}")
snippets = collect_news_snippets(topic)
all_snippets.extend(snippets)
time.sleep(0.5) # Rate limiting
with open(output_file, "w") as f:
for snippet in all_snippets:
f.write(json.dumps(snippet) + "\n")
print(f"Collected {len(all_snippets)} snippets to {output_file}")
return all_snippets
corpus = build_corpus(TOPICS)From Snippets to Training Data
The collected JSONL file needs annotation before it becomes training data. You have three practical options: manual labeling with Label Studio, semi-automated labeling using a frontier LLM as a first pass with human review, or active learning where you label a small seed set and let the model request the most informative samples.
For the semi-automated approach, send each snippet to an LLM with a structured prompt asking for positive, negative, or neutral classification plus a confidence score. Then have a human reviewer check only the low-confidence predictions. This typically cuts annotation time by 60-70% compared to fully manual labeling.
import json
def prepare_training_split(corpus_file, train_ratio=0.8):
"""Split annotated corpus into train/val sets."""
samples = []
with open(corpus_file) as f:
for line in f:
sample = json.loads(line)
if sample["sentiment_label"] is not None:
samples.append({
"text": f"{sample['title']}. {sample['snippet']}",
"label": sample["sentiment_label"],
})
split_idx = int(len(samples) * train_ratio)
train = samples[:split_idx]
val = samples[split_idx:]
print(f"Train: {len(train)}, Val: {len(val)}")
return train, valTemporal Sampling Matters
One mistake teams make is collecting all data in a single session. News sentiment shifts with market cycles, political events, and seasonal patterns. A corpus collected entirely during a market crash will skew heavily negative. Run collections across multiple weeks, ideally covering different market conditions, to build a representative dataset.
Schedule daily or weekly collection runs using cron or a workflow tool like n8n. At $0.005 per query, running 50 topic searches daily for 30 days costs $7.50 and produces roughly 15,000-75,000 raw snippets before deduplication.
Comparison with Full Scraping
Building a full news scraper means maintaining parsers for every source, handling paywalls, managing proxy rotation, and dealing with anti-bot measures. A team at a fintech startup reported spending 3 months building and maintaining a news scraper that covered 200 sources. They switched to a search API approach and reduced their data pipeline code from 4,000 lines to 120 lines while improving snippet quality.
The tradeoff: search API snippets are summaries, not full articles. If your model needs full article text, you will still need to fetch and parse individual URLs. But for headline-level and snippet-level sentiment, which covers most financial NLP use cases, the search approach is sufficient and dramatically simpler.
Cost Breakdown
A typical ML sentiment project needs 10,000-50,000 labeled samples for fine-tuning. Using search API collection at 20 results per query:
- 500-2,500 queries needed: $2.50-$12.50 with Scavio at $0.005/query
- Semi-automated labeling with LLM: ~$5-15 for initial pass
- Human review of low-confidence labels: 10-20 hours depending on corpus size
- Total data cost: under $30 for a production-quality training corpus
Compare this to GDELT processing where you spend $0 on data but 40-80 hours cleaning, deduplicating, and reformatting before you can even start labeling.
When This Does Not Work
This approach has clear limitations. If you need full article text for document-level sentiment, search snippets are not enough. If you need multilingual coverage beyond what Google indexes well, GDELT still has better breadth. And if you need historical data going back years, search APIs return current results, not archives.
For most teams building real-time sentiment classifiers on English news, the search API corpus approach gets you from zero to training data in a day instead of a month.