Overview
Data scientists lose hours searching for datasets across fragmented platforms. This workflow uses Scavio's MCP server to search across Kaggle, HuggingFace, data.gov, and Google Dataset Search from a single agent connection. Results are deduplicated, scored by relevance and freshness, and formatted for notebook import.
Trigger
On-demand when a data scientist needs datasets for a new project.
Schedule
On-demand
Workflow Steps
Define Dataset Requirements
Specify the topic, required columns, minimum size, preferred format, and freshness requirements.
Search Across Platforms via MCP
Call Scavio MCP tools/call with site-specific queries for Kaggle, HuggingFace, data.gov, and Google Dataset Search.
Parse and Normalize Results
Extract dataset titles, URLs, descriptions, and source platforms from search results.
Deduplicate and Score
Remove duplicate datasets that appear on multiple platforms. Score by relevance to requirements.
Output Discovery Report
Format results as a markdown report or JSON file with download links and metadata.
Python Implementation
import requests, os, json
API_KEY = os.environ["SCAVIO_API_KEY"]
H = {"x-api-key": API_KEY, "Content-Type": "application/json"}
PLATFORMS = {
"kaggle": "site:kaggle.com/datasets",
"huggingface": "site:huggingface.co/datasets",
"data_gov": "site:data.gov",
"google_datasets": "site:datasetsearch.research.google.com",
}
def discover_datasets(topic: str, platforms: list = None) -> list:
targets = {k: v for k, v in PLATFORMS.items() if not platforms or k in platforms}
all_datasets = []
for source, site_filter in targets.items():
resp = requests.post(
"https://api.scavio.dev/api/v1/search",
headers=H,
json={"query": f"{topic} dataset {site_filter}", "country_code": "us"},
timeout=10,
)
for r in resp.json().get("organic_results", []):
all_datasets.append({
"title": r.get("title", ""),
"url": r.get("link", ""),
"source": source,
"snippet": r.get("snippet", ""),
"date": r.get("date", ""),
})
# Deduplicate
seen = set()
unique = []
for d in all_datasets:
if d["url"] not in seen:
seen.add(d["url"])
unique.append(d)
return unique
results = discover_datasets("air quality monitoring sensor data")
print(f"Found {len(results)} datasets across {len(set(d['source'] for d in results))} platforms")
for d in results[:8]:
print(f" [{d['source']}] {d['title']}: {d['url']}")JavaScript Implementation
const H = {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'};
const PLATFORMS = {kaggle:'site:kaggle.com/datasets', huggingface:'site:huggingface.co/datasets', data_gov:'site:data.gov', google_datasets:'site:datasetsearch.research.google.com'};
async function discoverDatasets(topic, platforms) {
const targets = platforms ? Object.fromEntries(Object.entries(PLATFORMS).filter(([k])=>platforms.includes(k))) : PLATFORMS;
const all = [];
for (const [source, siteFilter] of Object.entries(targets)) {
const r = await fetch('https://api.scavio.dev/api/v1/search', {method:'POST', headers:H, body:JSON.stringify({query:topic+' dataset '+siteFilter, country_code:'us'})});
for (const o of (await r.json()).organic_results||[]) {
all.push({title:o.title||'', url:o.link||'', source, snippet:o.snippet||'', date:o.date||''});
}
}
const seen = new Set();
return all.filter(d=>{ if (seen.has(d.url)) return false; seen.add(d.url); return true; });
}
const results = await discoverDatasets('air quality monitoring sensor data');
console.log('Found '+results.length+' datasets across '+new Set(results.map(d=>d.source)).size+' platforms');
for (const d of results.slice(0,8)) console.log(' ['+d.source+'] '+d.title+': '+d.url);Platforms Used
Web search with knowledge graph, PAA, and AI overviews