Workflow

Cross-Platform Dataset Discovery via MCP Workflow

Workflow that discovers datasets across Kaggle, HuggingFace, data.gov, and Google Dataset Search through a single MCP connection to Scavio.

Overview

Data scientists lose hours searching for datasets across fragmented platforms. This workflow uses Scavio's MCP server to search across Kaggle, HuggingFace, data.gov, and Google Dataset Search from a single agent connection. Results are deduplicated, scored by relevance and freshness, and formatted for notebook import.

Trigger

On-demand when a data scientist needs datasets for a new project.

Schedule

On-demand

Workflow Steps

1

Define Dataset Requirements

Specify the topic, required columns, minimum size, preferred format, and freshness requirements.

2

Search Across Platforms via MCP

Call Scavio MCP tools/call with site-specific queries for Kaggle, HuggingFace, data.gov, and Google Dataset Search.

3

Parse and Normalize Results

Extract dataset titles, URLs, descriptions, and source platforms from search results.

4

Deduplicate and Score

Remove duplicate datasets that appear on multiple platforms. Score by relevance to requirements.

5

Output Discovery Report

Format results as a markdown report or JSON file with download links and metadata.

Python Implementation

Python
import requests, os, json

API_KEY = os.environ["SCAVIO_API_KEY"]
H = {"x-api-key": API_KEY, "Content-Type": "application/json"}

PLATFORMS = {
    "kaggle": "site:kaggle.com/datasets",
    "huggingface": "site:huggingface.co/datasets",
    "data_gov": "site:data.gov",
    "google_datasets": "site:datasetsearch.research.google.com",
}

def discover_datasets(topic: str, platforms: list = None) -> list:
    targets = {k: v for k, v in PLATFORMS.items() if not platforms or k in platforms}
    all_datasets = []

    for source, site_filter in targets.items():
        resp = requests.post(
            "https://api.scavio.dev/api/v1/search",
            headers=H,
            json={"query": f"{topic} dataset {site_filter}", "country_code": "us"},
            timeout=10,
        )
        for r in resp.json().get("organic_results", []):
            all_datasets.append({
                "title": r.get("title", ""),
                "url": r.get("link", ""),
                "source": source,
                "snippet": r.get("snippet", ""),
                "date": r.get("date", ""),
            })

    # Deduplicate
    seen = set()
    unique = []
    for d in all_datasets:
        if d["url"] not in seen:
            seen.add(d["url"])
            unique.append(d)
    return unique

results = discover_datasets("air quality monitoring sensor data")
print(f"Found {len(results)} datasets across {len(set(d['source'] for d in results))} platforms")
for d in results[:8]:
    print(f"  [{d['source']}] {d['title']}: {d['url']}")

JavaScript Implementation

JavaScript
const H = {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'};
const PLATFORMS = {kaggle:'site:kaggle.com/datasets', huggingface:'site:huggingface.co/datasets', data_gov:'site:data.gov', google_datasets:'site:datasetsearch.research.google.com'};

async function discoverDatasets(topic, platforms) {
  const targets = platforms ? Object.fromEntries(Object.entries(PLATFORMS).filter(([k])=>platforms.includes(k))) : PLATFORMS;
  const all = [];
  for (const [source, siteFilter] of Object.entries(targets)) {
    const r = await fetch('https://api.scavio.dev/api/v1/search', {method:'POST', headers:H, body:JSON.stringify({query:topic+' dataset '+siteFilter, country_code:'us'})});
    for (const o of (await r.json()).organic_results||[]) {
      all.push({title:o.title||'', url:o.link||'', source, snippet:o.snippet||'', date:o.date||''});
    }
  }
  const seen = new Set();
  return all.filter(d=>{ if (seen.has(d.url)) return false; seen.add(d.url); return true; });
}

const results = await discoverDatasets('air quality monitoring sensor data');
console.log('Found '+results.length+' datasets across '+new Set(results.map(d=>d.source)).size+' platforms');
for (const d of results.slice(0,8)) console.log('  ['+d.source+'] '+d.title+': '+d.url);

Platforms Used

Google

Web search with knowledge graph, PAA, and AI overviews

Frequently Asked Questions

Data scientists lose hours searching for datasets across fragmented platforms. This workflow uses Scavio's MCP server to search across Kaggle, HuggingFace, data.gov, and Google Dataset Search from a single agent connection. Results are deduplicated, scored by relevance and freshness, and formatted for notebook import.

This workflow uses a on-demand when a data scientist needs datasets for a new project.. On-demand.

This workflow uses the following Scavio platforms: google. Each platform is called via the same unified API endpoint.

Yes. Scavio's free tier includes 250 credits per month with no credit card required. That is enough to test and validate this workflow before scaling it.

Cross-Platform Dataset Discovery via MCP Workflow

Workflow that discovers datasets across Kaggle, HuggingFace, data.gov, and Google Dataset Search through a single MCP connection to Scavio.