Solution

Federated Dataset Discovery via MCP Search

Researchers and data scientists spend hours searching across Google Dataset Search, Kaggle, HuggingFace, and government portals to find relevant datasets. Each platform has its own

The Problem

Researchers and data scientists spend hours searching across Google Dataset Search, Kaggle, HuggingFace, and government portals to find relevant datasets. Each platform has its own search interface and no unified API. Results are fragmented and hard to compare.

The Scavio Solution

Use Scavio's MCP server or search API to query across platforms for datasets. Search Google for dataset portals, cross-reference with specific platform results, and aggregate findings into a unified discovery pipeline. One search covers what previously required 4 separate platform searches.

Before

Data scientist searches Kaggle, then Google Dataset Search, then HuggingFace, then data.gov. 3 hours to compile a list of 15 relevant datasets.

After

MCP-connected agent searches all sources through Scavio in 5 minutes. Returns 30 relevant datasets with metadata, links, and freshness dates.

Who It Is For

Data scientists and researchers who need to discover relevant datasets across multiple platforms without manually searching each one.

Key Benefits

  • One search covers multiple dataset platforms
  • MCP integration for agent-driven discovery
  • Structured results with dataset metadata
  • Cross-platform deduplication
  • $0.005/search replaces hours of manual browsing

Python Example

Python
import requests, os, json

API_KEY = os.environ["SCAVIO_API_KEY"]
H = {"x-api-key": API_KEY, "Content-Type": "application/json"}

DATASET_SOURCES = [
    "site:kaggle.com/datasets",
    "site:huggingface.co/datasets",
    "site:data.gov",
    "site:datasetsearch.research.google.com",
]

def discover_datasets(topic: str) -> list:
    """Federated dataset discovery across platforms."""
    all_datasets = []
    for source in DATASET_SOURCES:
        resp = requests.post(
            "https://api.scavio.dev/api/v1/search",
            headers=H,
            json={"query": f"{topic} {source}", "country_code": "us"},
            timeout=10,
        )
        data = resp.json()
        for r in data.get("organic_results", []):
            all_datasets.append({
                "title": r.get("title", ""),
                "url": r.get("link", ""),
                "source": source.split("site:")[1].split("/")[0],
                "snippet": r.get("snippet", ""),
            })
    # Deduplicate by URL
    seen = set()
    unique = []
    for d in all_datasets:
        if d["url"] not in seen:
            seen.add(d["url"])
            unique.append(d)
    return unique

datasets = discover_datasets("climate change temperature")
for d in datasets[:10]:
    print(f"[{d['source']}] {d['title']}: {d['url']}")

JavaScript Example

JavaScript
const H = {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'};
const SOURCES = ['site:kaggle.com/datasets','site:huggingface.co/datasets','site:data.gov','site:datasetsearch.research.google.com'];

async function discoverDatasets(topic) {
  const all = [];
  for (const source of SOURCES) {
    const r = await fetch('https://api.scavio.dev/api/v1/search', {method:'POST', headers:H, body:JSON.stringify({query:topic+' '+source, country_code:'us'})});
    const d = await r.json();
    for (const o of d.organic_results||[]) {
      all.push({title:o.title, url:o.link, source:source.split('site:')[1].split('/')[0], snippet:o.snippet});
    }
  }
  const seen = new Set();
  return all.filter(d=>{ if (seen.has(d.url)) return false; seen.add(d.url); return true; });
}

const datasets = await discoverDatasets('climate change temperature');
for (const d of datasets.slice(0,10)) console.log('['+d.source+'] '+d.title+': '+d.url);

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Frequently Asked Questions

Researchers and data scientists spend hours searching across Google Dataset Search, Kaggle, HuggingFace, and government portals to find relevant datasets. Each platform has its own search interface and no unified API. Results are fragmented and hard to compare.

Use Scavio's MCP server or search API to query across platforms for datasets. Search Google for dataset portals, cross-reference with specific platform results, and aggregate findings into a unified discovery pipeline. One search covers what previously required 4 separate platform searches.

Data scientists and researchers who need to discover relevant datasets across multiple platforms without manually searching each one.

Yes. Scavio's free tier includes 250 credits per month with no credit card required. That is enough to validate this solution in your workflow.

Federated Dataset Discovery via MCP Search

Use Scavio's MCP server or search API to query across platforms for datasets. Search Google for dataset portals, cross-reference with specific platform results, and aggregate findi