mobusmcpdatasets

Mobus MCP: Multi-Platform Dataset Discovery

Open-source MCP server for discovering datasets across Kaggle, Hugging Face, and data.gov. Review of strengths and limitations.

7 min

Mobus is an open-source MCP server that lets LLMs discover and query datasets across multiple platforms: Kaggle, Hugging Face, government open data portals, and academic repositories. Instead of searching for datasets manually, your agent can find and evaluate datasets conversationally.

What Mobus does

Mobus exposes three MCP tools to your LLM:

  • search_datasets: find datasets by keyword across Kaggle, Hugging Face, data.gov
  • get_dataset_info: retrieve metadata, schema, size, and license for a specific dataset
  • preview_dataset: return the first N rows of a dataset for evaluation

Setting up Mobus

Bash
# Clone and install
git clone https://github.com/mobus-ai/mobus-mcp.git
cd mobus-mcp
pip install -r requirements.txt

# Set API keys for data platforms (optional, enhances results)
export KAGGLE_USERNAME=your_username
export KAGGLE_KEY=your_key
export HF_TOKEN=your_huggingface_token

# Run the MCP server
python -m mobus_mcp

Configuring with Claude or other MCP clients

JSON
{
  "mcpServers": {
    "mobus": {
      "command": "python",
      "args": ["-m", "mobus_mcp"],
      "env": {
        "KAGGLE_USERNAME": "your_username",
        "KAGGLE_KEY": "your_key"
      }
    },
    "search": {
      "url": "https://mcp.scavio.dev/mcp",
      "headers": {
        "Authorization": "Bearer YOUR_API_KEY"
      }
    }
  }
}

Real use cases

Mobus is most useful for data scientists and ML engineers who frequently search for training data or benchmark datasets:

  • "Find me a sentiment analysis dataset with at least 100K samples"
  • "What open datasets exist for real estate price prediction?"
  • "Show me the schema of the NYC taxi trips dataset on Kaggle"
  • "Find government open data on air quality in California"

Strengths

  • Multi-platform search: Kaggle, Hugging Face, and open data portals in one interface
  • Schema inspection: see column names and types before downloading
  • License filtering: find commercially usable datasets only
  • Open source: self-hosted, no recurring costs beyond compute

Limitations

  • Coverage gaps: does not index all data platforms (missing AWS Open Data, BigQuery public datasets)
  • Preview limits: large datasets can only preview first 100-1000 rows
  • No data quality scoring: you still need to evaluate dataset quality manually
  • Slow for large searches: queries Kaggle and Hugging Face APIs sequentially
  • No local caching: same query hits external APIs every time

Combining Mobus with web search

Mobus finds formal datasets on data platforms. For informal data and context around datasets, pair it with a web search MCP. Your agent can find a dataset on Kaggle via Mobus, then search for usage examples and known issues via web search.

Python
# Example agent workflow:
# 1. Agent uses Mobus to find a dataset
# 2. Agent uses web search to find usage examples and known issues
# 3. Agent evaluates and recommends the best option

# The web search complements Mobus with:
# - Blog posts about dataset quality issues
# - Academic papers citing the dataset
# - GitHub repos using the dataset
# - Reddit discussions about limitations

Bottom line

Mobus fills a genuine gap: dataset discovery via MCP. It is most valuable for teams that frequently search for training data across multiple platforms. The setup is straightforward and the tool is free. Pair it with a web search MCP for complete research capability.