AI Agent Reliability: When Tools Hit Walls
Hermes and Pi users hitting walls frequently. Common causes: rate limits, stale auth, timeouts. Solutions: health checks, fallbacks, retry patterns.
AI agents fail silently when their tools hit walls. Rate limits return empty responses, stale auth tokens produce cryptic errors, endpoint changes break integrations without warning, and timeouts leave agents stuck in retry loops. The fix is systematic: health checks before execution, graceful degradation when tools fail, fallback providers for critical capabilities, and exponential backoff with circuit breakers.
Common wall types
Hermes users report agents "hitting walls" when tools fail. Pi Coding Agent users see tools fail silently. The causes cluster into four categories. Rate limits: API returns 429, agent retries immediately, burns through budget. Stale auth: token expired, agent gets 401 but interprets it as "no results found." Endpoint changes: API updated its schema, agent sends old format, gets 400. Timeouts: slow API, agent waits forever or gives up too early.
Health checks before execution
import requests, time, os
from functools import wraps
class ToolHealth:
def __init__(self):
self.status = {}
self.last_check = {}
def check(self, tool_name: str, check_fn, interval: int = 300) -> bool:
"""Run health check if stale. Cache result for interval seconds."""
now = time.time()
if now - self.last_check.get(tool_name, 0) < interval:
return self.status.get(tool_name, False)
try:
result = check_fn()
self.status[tool_name] = result
except Exception:
self.status[tool_name] = False
self.last_check[tool_name] = now
return self.status[tool_name]
health = ToolHealth()
def search_health_check() -> bool:
resp = requests.post("https://api.scavio.dev/api/v1/search",
headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
json={"platform": "google", "query": "test"},
timeout=5)
return resp.status_code == 200Graceful degradation pattern
class ToolWithFallback:
def __init__(self, primary_fn, fallback_fn=None, max_retries=3):
self.primary = primary_fn
self.fallback = fallback_fn
self.max_retries = max_retries
self.failures = 0
self.circuit_open = False
self.circuit_open_until = 0
def call(self, *args, **kwargs):
# Circuit breaker: skip primary if recently failed too much
if self.circuit_open and time.time() < self.circuit_open_until:
if self.fallback:
return self.fallback(*args, **kwargs)
return {"error": "circuit_open", "message": "Tool temporarily unavailable"}
# Try primary with exponential backoff
for attempt in range(self.max_retries):
try:
result = self.primary(*args, **kwargs)
self.failures = 0
self.circuit_open = False
return result
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
wait = 2 ** attempt
time.sleep(wait)
elif e.response.status_code == 401:
return {"error": "auth_failed", "message": "Credential refresh needed"}
else:
break
except requests.exceptions.Timeout:
continue
# Primary failed, open circuit
self.failures += 1
if self.failures >= 3:
self.circuit_open = True
self.circuit_open_until = time.time() + 60
# Try fallback
if self.fallback:
return self.fallback(*args, **kwargs)
return {"error": "tool_failed", "attempts": self.max_retries}Fallback provider pattern
For critical capabilities like web search, configure multiple providers. If the primary search API is rate-limited, fall back to a secondary. The agent should not know or care which provider served the result. This is transparent at the tool layer.
def search_primary(query: str) -> dict:
resp = requests.post("https://api.scavio.dev/api/v1/search",
headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
json={"platform": "google", "query": query}, timeout=10)
resp.raise_for_status()
return resp.json()
def search_fallback(query: str) -> dict:
"""Fallback: use cached results or return empty with explanation."""
cached = load_cache(query)
if cached and (time.time() - cached["ts"]) < 86400:
return {**cached["data"], "_source": "cache", "_age_hours": (time.time() - cached["ts"]) / 3600}
return {"organic": [], "_source": "fallback", "_note": "No results available, try later"}
search_tool = ToolWithFallback(search_primary, search_fallback, max_retries=3)Observability for agent tools
Log every tool call with: timestamp, tool name, latency, status code, success/failure. Review weekly. Common patterns that indicate problems: increasing latency (endpoint degradation), burst failures at specific times (rate limit windows), 401 errors after token expiry schedules. Fix these proactively before your agents start hallucinating because their tools silently return empty results.
Architecture summary
- Layer 1: Health checks run every 5 minutes per tool
- Layer 2: Circuit breakers prevent cascading failures
- Layer 3: Exponential backoff handles transient issues
- Layer 4: Fallback providers maintain capability
- Layer 5: Observability logs catch degradation trends