MCP Server Health Monitoring for Production Agents
Production MCP servers need observability. Build a health check that monitors every tool, alerts on degradation, and informs routing decisions.
Production AI agents using MCP servers have no built-in observability. When search quality degrades, there is no way to tell whether the MCP server is down, the upstream provider is rate-limited, or the agent is making poor tool selections. Teams discover problems only when users complain about bad answers. That is too late.
What to monitor
Four metrics per MCP tool: success rate, latency, result count, and error type. Set baselines for each. Google search typically returns in 1-2 seconds with 10+ results. Reddit search may be slower with fewer results. YouTube search has different characteristics still. Each tool needs its own baseline, not a single threshold applied globally.
The health check pattern
import requests, os, time, json
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}
PLATFORMS = ['google', 'reddit', 'youtube', 'amazon', 'walmart']
def health_check():
report = {'ts': time.strftime('%Y-%m-%dT%H:%M:%SZ')}
for p in PLATFORMS:
start = time.time()
try:
resp = requests.post('https://api.scavio.dev/api/v1/search',
headers=H, json={'platform': p, 'query': 'test'}, timeout=15)
report[p] = {
'ok': True,
'latency': round(time.time() - start, 2),
'results': len(resp.json().get('organic', [])),
'status': resp.status_code
}
except Exception as e:
report[p] = {'ok': False, 'error': str(e)}
return reportRouting informed by health data
Health data is not just for alerting. Feed it back into routing decisions. If YouTube search latency spikes above 5 seconds, the agent can temporarily prefer Google search for video queries. If Amazon search returns zero results for a period, route product queries to Walmart. Health-aware routing turns monitoring from passive observability into active resilience.
Run every 5 minutes
A 5-minute health check cadence catches most issues before users notice. Log to JSONL for historical analysis. Alert to Slack for immediate visibility. The 5 API calls per check cost $0.025. That is $7.20 per month for comprehensive MCP health monitoring. Cheaper than a single customer escalation caused by undetected degradation.