Evaluate MCP servers for data quality by running a standardized set of test queries and scoring the results on freshness, coverage, and factual accuracy. Most MCP server comparisons focus on latency and uptime but ignore the quality of the data returned, which directly impacts LLM output. This tutorial builds a scoring harness that tests a search MCP server against a curated set of queries with known-good answers, then produces a quality report. We use Scavio's MCP endpoint as the server under test.
Prerequisites
- Python 3.8+ installed
- requests library installed
- A Scavio API key from scavio.dev
- A set of test queries with known expected results
Walkthrough
Step 1: Define the evaluation dataset
Create a list of test queries paired with expected attributes like minimum result count and required domains.
import os, requests
API_KEY = os.environ['SCAVIO_API_KEY']
EVAL_SET = [
{'query': 'python 3.13 release date', 'expected_domain': 'python.org', 'min_results': 3},
{'query': 'react 19 new features', 'expected_domain': 'react.dev', 'min_results': 3},
{'query': 'nvidia h200 price', 'expected_domain': 'nvidia.com', 'min_results': 2},
{'query': 'fastapi latest version', 'expected_domain': 'fastapi.tiangolo.com', 'min_results': 3},
]Step 2: Run queries and collect results
Send each evaluation query through the Scavio API and record the raw results for scoring.
def run_eval_query(test_case: dict) -> dict:
resp = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'platform': 'google', 'query': test_case['query']}, timeout=15)
resp.raise_for_status()
results = resp.json().get('organic_results', [])
return {
'query': test_case['query'],
'results': results,
'expected_domain': test_case['expected_domain'],
'min_results': test_case['min_results'],
}Step 3: Score each response
Score on three dimensions: coverage (result count meets minimum), authority (expected domain appears in top results), and freshness (results contain current-year dates).
def score_response(eval_result: dict) -> dict:
results = eval_result['results']
coverage = 1.0 if len(results) >= eval_result['min_results'] else len(results) / eval_result['min_results']
domain_found = any(eval_result['expected_domain'] in r.get('link', '') for r in results[:5])
authority = 1.0 if domain_found else 0.0
year_mentions = sum(1 for r in results[:5] if '2026' in r.get('snippet', '') or '2025' in r.get('snippet', ''))
freshness = min(year_mentions / 3, 1.0)
return {
'query': eval_result['query'],
'coverage': round(coverage, 2),
'authority': authority,
'freshness': round(freshness, 2),
'composite': round((coverage + authority + freshness) / 3, 2),
}Step 4: Generate the quality report
Run the full evaluation and print a summary report with per-query scores and an aggregate quality score.
def run_evaluation():
scores = []
for test in EVAL_SET:
result = run_eval_query(test)
score = score_response(result)
scores.append(score)
print(f'{score["query"][:40]:<42} C={score["coverage"]} A={score["authority"]} F={score["freshness"]} => {score["composite"]}')
avg = round(sum(s['composite'] for s in scores) / len(scores), 2)
print(f'\nAggregate quality score: {avg}/1.00')
return scores
run_evaluation()Python Example
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
def eval_query(query, expected_domain):
data = requests.post('https://api.scavio.dev/api/v1/search', headers=H,
json={'platform': 'google', 'query': query}, timeout=15).json()
results = data.get('organic_results', [])
found = any(expected_domain in r.get('link', '') for r in results[:5])
return {'query': query, 'count': len(results), 'authority': found}
print(eval_query('python 3.13 release date', 'python.org'))JavaScript Example
const H = {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'};
async function evalQuery(query, expectedDomain) {
const r = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST', headers: H, body: JSON.stringify({platform: 'google', query})
});
const results = (await r.json()).organic_results || [];
const found = results.slice(0, 5).some(r => r.link?.includes(expectedDomain));
return {query, count: results.length, authority: found};
}
evalQuery('python 3.13 release date', 'python.org').then(console.log);Expected Output
A quality report scoring each MCP server test query on coverage, authority, and freshness, with an aggregate composite score out of 1.00.