Tutorial

How to Evaluate MCP Servers for Data Quality

Build an automated evaluation harness to compare MCP search servers on freshness, coverage, and accuracy. Step-by-step with Python code.

Evaluate MCP servers for data quality by running a standardized set of test queries and scoring the results on freshness, coverage, and factual accuracy. Most MCP server comparisons focus on latency and uptime but ignore the quality of the data returned, which directly impacts LLM output. This tutorial builds a scoring harness that tests a search MCP server against a curated set of queries with known-good answers, then produces a quality report. We use Scavio's MCP endpoint as the server under test.

Prerequisites

  • Python 3.8+ installed
  • requests library installed
  • A Scavio API key from scavio.dev
  • A set of test queries with known expected results

Walkthrough

Step 1: Define the evaluation dataset

Create a list of test queries paired with expected attributes like minimum result count and required domains.

Python
import os, requests

API_KEY = os.environ['SCAVIO_API_KEY']

EVAL_SET = [
    {'query': 'python 3.13 release date', 'expected_domain': 'python.org', 'min_results': 3},
    {'query': 'react 19 new features', 'expected_domain': 'react.dev', 'min_results': 3},
    {'query': 'nvidia h200 price', 'expected_domain': 'nvidia.com', 'min_results': 2},
    {'query': 'fastapi latest version', 'expected_domain': 'fastapi.tiangolo.com', 'min_results': 3},
]

Step 2: Run queries and collect results

Send each evaluation query through the Scavio API and record the raw results for scoring.

Python
def run_eval_query(test_case: dict) -> dict:
    resp = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'platform': 'google', 'query': test_case['query']}, timeout=15)
    resp.raise_for_status()
    results = resp.json().get('organic_results', [])
    return {
        'query': test_case['query'],
        'results': results,
        'expected_domain': test_case['expected_domain'],
        'min_results': test_case['min_results'],
    }

Step 3: Score each response

Score on three dimensions: coverage (result count meets minimum), authority (expected domain appears in top results), and freshness (results contain current-year dates).

Python
def score_response(eval_result: dict) -> dict:
    results = eval_result['results']
    coverage = 1.0 if len(results) >= eval_result['min_results'] else len(results) / eval_result['min_results']
    domain_found = any(eval_result['expected_domain'] in r.get('link', '') for r in results[:5])
    authority = 1.0 if domain_found else 0.0
    year_mentions = sum(1 for r in results[:5] if '2026' in r.get('snippet', '') or '2025' in r.get('snippet', ''))
    freshness = min(year_mentions / 3, 1.0)
    return {
        'query': eval_result['query'],
        'coverage': round(coverage, 2),
        'authority': authority,
        'freshness': round(freshness, 2),
        'composite': round((coverage + authority + freshness) / 3, 2),
    }

Step 4: Generate the quality report

Run the full evaluation and print a summary report with per-query scores and an aggregate quality score.

Python
def run_evaluation():
    scores = []
    for test in EVAL_SET:
        result = run_eval_query(test)
        score = score_response(result)
        scores.append(score)
        print(f'{score["query"][:40]:<42} C={score["coverage"]} A={score["authority"]} F={score["freshness"]} => {score["composite"]}')
    avg = round(sum(s['composite'] for s in scores) / len(scores), 2)
    print(f'\nAggregate quality score: {avg}/1.00')
    return scores

run_evaluation()

Python Example

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

def eval_query(query, expected_domain):
    data = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
        json={'platform': 'google', 'query': query}, timeout=15).json()
    results = data.get('organic_results', [])
    found = any(expected_domain in r.get('link', '') for r in results[:5])
    return {'query': query, 'count': len(results), 'authority': found}

print(eval_query('python 3.13 release date', 'python.org'))

JavaScript Example

JavaScript
const H = {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'};
async function evalQuery(query, expectedDomain) {
  const r = await fetch('https://api.scavio.dev/api/v1/search', {
    method: 'POST', headers: H, body: JSON.stringify({platform: 'google', query})
  });
  const results = (await r.json()).organic_results || [];
  const found = results.slice(0, 5).some(r => r.link?.includes(expectedDomain));
  return {query, count: results.length, authority: found};
}
evalQuery('python 3.13 release date', 'python.org').then(console.log);

Expected Output

JSON
A quality report scoring each MCP server test query on coverage, authority, and freshness, with an aggregate composite score out of 1.00.

Related Tutorials

Frequently Asked Questions

Most developers complete this tutorial in 15 to 30 minutes. You will need a Scavio API key (free tier works) and a working Python or JavaScript environment.

Python 3.8+ installed. requests library installed. A Scavio API key from scavio.dev. A set of test queries with known expected results. A Scavio API key gives you 500 free credits per month.

Yes. The free tier includes 500 credits per month, which is more than enough to complete this tutorial and prototype a working solution.

Scavio has a native LangChain package (langchain-scavio), an MCP server, and a plain REST API that works with any HTTP client. This tutorial uses the raw REST API, but you can adapt to your framework of choice.

Start Building

Build an automated evaluation harness to compare MCP search servers on freshness, coverage, and accuracy. Step-by-step with Python code.