Agent Error Handling Nobody Talks About
Agent tutorials skip error handling. Production agents need: search fallbacks, rate limit backoff, malformed response recovery, and budget guards.
The hardest part of building reliable AI agents is not the LLM integration or the prompt engineering -- it is the error handling. Most agent demos show the happy path. Production agents spend 60% of their code on retry logic, graceful degradation, budget exhaustion, and partial result returns. Here are the patterns that actually work.
Retry with Exponential Backoff and Jitter
Simple retry loops cause thundering herd problems when multiple agent instances retry simultaneously. Add jitter to spread retries across time.
import time, random
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1.0, max_delay=30.0):
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
for attempt in range(max_retries + 1):
try:
return fn(*args, **kwargs)
except (ConnectionError, TimeoutError) as e:
if attempt == max_retries:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.5)
time.sleep(delay + jitter)
except Exception:
raise # Do not retry on non-transient errors
return wrapper
return decorator
@retry_with_backoff(max_retries=3)
def call_search_api(query: str) -> dict:
import requests, os
resp = requests.post(
"https://api.scavio.dev/api/v1/search",
headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
json={"query": query, "num_results": 5},
timeout=10,
)
resp.raise_for_status()
return resp.json()Graceful Degradation When Tools Fail
When a tool fails, the agent should not crash or hallucinate. It should return what it has and tell the user what it could not do.
class ToolResult:
def __init__(self, data=None, error=None, partial=False):
self.data = data
self.error = error
self.partial = partial
self.success = error is None
def search_with_fallback(query: str) -> ToolResult:
"""Try primary search, fall back to cached results."""
try:
data = call_search_api(query)
return ToolResult(data=data)
except Exception as e:
# Check local cache before giving up
cached = get_cached_results(query)
if cached:
return ToolResult(
data=cached,
error=f"Live search failed ({e}), using cached results",
partial=True,
)
return ToolResult(
error=f"Search unavailable: {e}. Cannot answer this question.",
)Budget Exhaustion Handling
An agent that does not track its own spending will drain your API budget on a single runaway query. Build budget awareness into the agent loop.
class AgentBudget:
def __init__(self, max_credits: int):
self.max_credits = max_credits
self.used = 0
self.log = []
def can_spend(self, cost: int = 1) -> bool:
return self.used + cost <= self.max_credits
def spend(self, cost: int, description: str):
self.used += cost
self.log.append({"cost": cost, "description": description, "total": self.used})
def remaining(self) -> int:
return self.max_credits - self.used
def agent_loop(query: str, budget: AgentBudget):
"""Main agent loop with budget enforcement."""
results = []
steps = plan_steps(query) # LLM plans what tools to call
for step in steps:
if not budget.can_spend(step.estimated_cost):
results.append({
"step": step.name,
"status": "skipped",
"reason": f"Budget exhausted ({budget.remaining()} credits left)",
})
continue
try:
output = execute_step(step)
budget.spend(step.estimated_cost, step.name)
results.append({"step": step.name, "status": "ok", "data": output})
except Exception as e:
results.append({"step": step.name, "status": "failed", "error": str(e)})
return synthesize_partial_results(query, results)Partial Result Returns
The worst agent behavior is all-or-nothing: either it completes perfectly or it returns nothing. Good agents return whatever they have and explain what is missing.
def synthesize_partial_results(query: str, results: list) -> dict:
completed = [r for r in results if r["status"] == "ok"]
failed = [r for r in results if r["status"] == "failed"]
skipped = [r for r in results if r["status"] == "skipped"]
answer = generate_answer(query, [r["data"] for r in completed])
caveats = []
if failed:
caveats.append(
f"Could not complete: {', '.join(r['step'] for r in failed)}"
)
if skipped:
caveats.append(
f"Skipped due to budget: {', '.join(r['step'] for r in skipped)}"
)
return {
"answer": answer,
"confidence": len(completed) / max(len(results), 1),
"caveats": caveats,
"steps_completed": len(completed),
"steps_total": len(results),
}Error Messages the LLM Can Understand
When returning errors to the LLM in an agent loop, the error text matters. Vague errors like "request failed" cause the LLM to retry blindly. Specific errors with instructions produce better behavior.
- Bad:
Error: 429. LLM interprets this as "try again." - Good:
Rate limit hit. Next request allowed in 30s. Do not retry. - Bad:
Search returned no results. LLM rephrases and retries 5 times. - Good:
No results found for this query. Simplify the query or answer from your training data.
The Pattern Most Teams Skip
Log every tool call, every error, and every retry with a correlation ID tied to the user request. When something goes wrong in production (and it will), you need to reconstruct exactly what the agent tried, in what order, and where it failed. Without this, debugging agent failures is guesswork. The teams that ship reliable agents are the ones that built observability before they built features.