How long does this build a dataset discovery agent with mobus mcp tutorial take?

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

What do I need before starting?

Python 3.9+ installed. Claude Code installed. requests library installed. A Scavio API key from scavio.dev. A Scavio API key gives you 250 free credits per month.

Can I run this tutorial with the free tier?

Yes. The free tier includes 250 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

What frameworks does this work with?

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Dataset Discovery Agent with Mobus MCP (2026)

Finding the right dataset for a research or ML project requires searching across data portals, academic repositories, and government databases. This tutorial builds a dataset discovery agent using Mobus MCP for structured data catalog access and Scavio search for discovering datasets across the open web. The agent searches, evaluates metadata, and catalogs relevant datasets. Cost: $0.005 per search query.

Prerequisites

Python 3.9+ installed
Claude Code installed
requests library installed
A Scavio API key from scavio.dev

Walkthrough

Step 1: Set up search for dataset discovery

Build a search function optimized for finding datasets. Target specific data portals and repositories with site-specific queries.

Python

import requests, os

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']

DATA_PORTALS = [
    'data.gov', 'kaggle.com', 'huggingface.co/datasets',
    'datasetsearch.research.google.com', 'zenodo.org',
    'archive.ics.uci.edu', 'registry.opendata.aws'
]

def search_datasets(topic: str, portal: str = None) -> list:
    query = f'{topic} dataset'
    if portal:
        query = f'site:{portal} {topic} dataset'
    resp = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'},
        json={'query': query, 'country_code': 'us', 'num_results': 10})
    results = resp.json().get('organic_results', [])
    datasets = []
    for r in results:
        datasets.append({
            'title': r['title'],
            'url': r['link'],
            'description': r.get('snippet', ''),
            'portal': r['link'].split('/')[2] if '/' in r['link'] else 'unknown'
        })
    return datasets

# Search across all portals
results = search_datasets('climate temperature')
print(f'Found {len(results)} datasets for climate temperature')
for r in results[:5]:
    print(f'  [{r["portal"]}] {r["title"][:50]}')

Step 2: Build the dataset evaluator

Evaluate discovered datasets on metadata quality, format, size indicators, and license information extracted from search snippets.

Python

import re

def evaluate_dataset(dataset: dict) -> dict:
    text = (dataset['title'] + ' ' + dataset['description']).lower()
    # Format detection
    formats = []
    for fmt in ['csv', 'json', 'parquet', 'xlsx', 'geojson', 'netcdf', 'hdf5']:
        if fmt in text:
            formats.append(fmt)
    # Size indicators
    size_match = re.search(r'(\d+(?:\.\d+)?\s*(?:gb|mb|tb|rows|records|entries))', text)
    size = size_match.group(1) if size_match else 'unknown'
    # License
    licenses = []
    for lic in ['cc0', 'cc-by', 'mit', 'apache', 'public domain', 'open', 'creative commons']:
        if lic in text:
            licenses.append(lic)
    # Freshness
    years = re.findall(r'20(2[3-9])', text)
    latest_year = max(int('20' + y) for y in years) if years else 0
    score = 0
    score += min(len(formats) * 15, 30)  # format variety
    score += 20 if size != 'unknown' else 0  # has size info
    score += 20 if licenses else 0  # has license
    score += 30 if latest_year >= 2025 else 15 if latest_year >= 2023 else 0
    return {
        'title': dataset['title'][:50],
        'url': dataset['url'],
        'formats': formats or ['unknown'],
        'size': size,
        'license': licenses[0] if licenses else 'check source',
        'score': score
    }

evaluated = [evaluate_dataset(d) for d in results]
evaluated.sort(key=lambda x: -x['score'])
for d in evaluated[:5]:
    print(f'  [{d["score"]:3d}] {d["title"]} ({d["formats"][0]}, {d["size"]})')

Step 3: Run multi-portal discovery pipeline

Search across multiple data portals for a given topic and compile a ranked catalog of datasets.

Python

import time

def discover_datasets(topic: str, portals: list = None) -> list:
    portals = portals or DATA_PORTALS[:4]  # limit to save credits
    all_datasets = []
    seen_urls = set()
    # General web search first
    general = search_datasets(topic)
    for d in general:
        if d['url'] not in seen_urls:
            seen_urls.add(d['url'])
            all_datasets.append(d)
    # Portal-specific searches
    for portal in portals:
        results = search_datasets(topic, portal)
        for d in results:
            if d['url'] not in seen_urls:
                seen_urls.add(d['url'])
                all_datasets.append(d)
        time.sleep(0.3)
    # Evaluate and rank
    evaluated = [evaluate_dataset(d) for d in all_datasets]
    evaluated.sort(key=lambda x: -x['score'])
    queries = 1 + len(portals)
    print(f'Dataset Discovery Report: {topic}')
    print(f'Searched: {queries} queries (${queries * 0.005:.3f})')
    print(f'Found: {len(evaluated)} unique datasets\n')
    for i, d in enumerate(evaluated[:10], 1):
        print(f'{i:2}. [{d["score"]:3d}] {d["title"]}')
        print(f'     Format: {d["formats"][0]} | Size: {d["size"]} | License: {d["license"]}')
        print(f'     URL: {d["url"]}')
    return evaluated

catalog = discover_datasets('global temperature anomaly')

Python Example

Python

import requests, os, time

SCAVIO_KEY = os.environ['SCAVIO_API_KEY']

def find_datasets(topic):
    datasets = []
    for query in [f'{topic} dataset', f'site:kaggle.com {topic}', f'site:huggingface.co {topic} dataset']:
        resp = requests.post('https://api.scavio.dev/api/v1/search',
            headers={'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json'},
            json={'query': query, 'country_code': 'us', 'num_results': 5})
        for r in resp.json().get('organic_results', []):
            datasets.append({'title': r['title'][:50], 'url': r['link']})
        time.sleep(0.3)
    seen = set()
    unique = [d for d in datasets if d['url'] not in seen and not seen.add(d['url'])]
    for d in unique[:5]:
        print(f'{d["title"]}: {d["url"]}')
    return unique

find_datasets('sentiment analysis')

JavaScript Example

JavaScript

const SCAVIO_KEY = process.env.SCAVIO_API_KEY;

async function findDatasets(topic) {
  const queries = [`${topic} dataset`, `site:kaggle.com ${topic}`, `site:huggingface.co ${topic} dataset`];
  const datasets = [];
  for (const q of queries) {
    const resp = await fetch('https://api.scavio.dev/api/v1/search', {
      method: 'POST',
      headers: { 'x-api-key': SCAVIO_KEY, 'Content-Type': 'application/json' },
      body: JSON.stringify({ query: q, country_code: 'us', num_results: 5 })
    });
    for (const r of (await resp.json()).organic_results || []) {
      datasets.push({ title: r.title.slice(0, 50), url: r.link });
    }
  }
  const seen = new Set();
  return datasets.filter(d => !seen.has(d.url) && seen.add(d.url)).slice(0, 10);
}

findDatasets('sentiment analysis').then(d => d.forEach(x => console.log(x.title)));

Expected Output

JSON

Found 10 datasets for climate temperature
  [kaggle.com] Climate Change: Earth Surface Temperature Data
  [data.gov] Global Historical Climatology Network Daily
  [huggingface.co] Global Temperature Anomaly Dataset 2026

Dataset Discovery Report: global temperature anomaly
Searched: 5 queries ($0.025)
Found: 18 unique datasets

 1. [ 80] Global Temperature Anomaly Dataset 2026
     Format: csv | Size: 2.3 gb | License: cc-by
 2. [ 65] NOAA Global Temperature Time Series
     Format: csv | Size: 450 mb | License: public domain

How to Build a Dataset Discovery Agent with Mobus MCP

Prerequisites

Walkthrough

Step 1: Set up search for dataset discovery

Step 2: Build the dataset evaluator

Step 3: Run multi-portal discovery pipeline

Python Example

JavaScript Example

Expected Output

Related Tutorials

Frequently Asked Questions

How long does this build a dataset discovery agent with mobus mcp tutorial take?

What do I need before starting?

Can I run this tutorial with the free tier?

What frameworks does this work with?

Start Building