agentserrorsproduction

Agent Error Handling Nobody Talks About

Agent tutorials skip error handling. Production agents need: search fallbacks, rate limit backoff, malformed response recovery, and budget guards.

8 min

The hardest part of building reliable AI agents is not the LLM integration or the prompt engineering -- it is the error handling. Most agent demos show the happy path. Production agents spend 60% of their code on retry logic, graceful degradation, budget exhaustion, and partial result returns. Here are the patterns that actually work.

Retry with Exponential Backoff and Jitter

Simple retry loops cause thundering herd problems when multiple agent instances retry simultaneously. Add jitter to spread retries across time.

Python
import time, random
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1.0, max_delay=30.0):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries + 1):
                try:
                    return fn(*args, **kwargs)
                except (ConnectionError, TimeoutError) as e:
                    if attempt == max_retries:
                        raise
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    jitter = random.uniform(0, delay * 0.5)
                    time.sleep(delay + jitter)
                except Exception:
                    raise  # Do not retry on non-transient errors
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def call_search_api(query: str) -> dict:
    import requests, os
    resp = requests.post(
        "https://api.scavio.dev/api/v1/search",
        headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
        json={"query": query, "num_results": 5},
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json()

Graceful Degradation When Tools Fail

When a tool fails, the agent should not crash or hallucinate. It should return what it has and tell the user what it could not do.

Python
class ToolResult:
    def __init__(self, data=None, error=None, partial=False):
        self.data = data
        self.error = error
        self.partial = partial
        self.success = error is None

def search_with_fallback(query: str) -> ToolResult:
    """Try primary search, fall back to cached results."""
    try:
        data = call_search_api(query)
        return ToolResult(data=data)
    except Exception as e:
        # Check local cache before giving up
        cached = get_cached_results(query)
        if cached:
            return ToolResult(
                data=cached,
                error=f"Live search failed ({e}), using cached results",
                partial=True,
            )
        return ToolResult(
            error=f"Search unavailable: {e}. Cannot answer this question.",
        )

Budget Exhaustion Handling

An agent that does not track its own spending will drain your API budget on a single runaway query. Build budget awareness into the agent loop.

Python
class AgentBudget:
    def __init__(self, max_credits: int):
        self.max_credits = max_credits
        self.used = 0
        self.log = []

    def can_spend(self, cost: int = 1) -> bool:
        return self.used + cost <= self.max_credits

    def spend(self, cost: int, description: str):
        self.used += cost
        self.log.append({"cost": cost, "description": description, "total": self.used})

    def remaining(self) -> int:
        return self.max_credits - self.used

def agent_loop(query: str, budget: AgentBudget):
    """Main agent loop with budget enforcement."""
    results = []
    steps = plan_steps(query)  # LLM plans what tools to call
    for step in steps:
        if not budget.can_spend(step.estimated_cost):
            results.append({
                "step": step.name,
                "status": "skipped",
                "reason": f"Budget exhausted ({budget.remaining()} credits left)",
            })
            continue
        try:
            output = execute_step(step)
            budget.spend(step.estimated_cost, step.name)
            results.append({"step": step.name, "status": "ok", "data": output})
        except Exception as e:
            results.append({"step": step.name, "status": "failed", "error": str(e)})
    return synthesize_partial_results(query, results)

Partial Result Returns

The worst agent behavior is all-or-nothing: either it completes perfectly or it returns nothing. Good agents return whatever they have and explain what is missing.

Python
def synthesize_partial_results(query: str, results: list) -> dict:
    completed = [r for r in results if r["status"] == "ok"]
    failed = [r for r in results if r["status"] == "failed"]
    skipped = [r for r in results if r["status"] == "skipped"]
    answer = generate_answer(query, [r["data"] for r in completed])
    caveats = []
    if failed:
        caveats.append(
            f"Could not complete: {', '.join(r['step'] for r in failed)}"
        )
    if skipped:
        caveats.append(
            f"Skipped due to budget: {', '.join(r['step'] for r in skipped)}"
        )
    return {
        "answer": answer,
        "confidence": len(completed) / max(len(results), 1),
        "caveats": caveats,
        "steps_completed": len(completed),
        "steps_total": len(results),
    }

Error Messages the LLM Can Understand

When returning errors to the LLM in an agent loop, the error text matters. Vague errors like "request failed" cause the LLM to retry blindly. Specific errors with instructions produce better behavior.

  • Bad: Error: 429. LLM interprets this as "try again."
  • Good: Rate limit hit. Next request allowed in 30s. Do not retry.
  • Bad: Search returned no results. LLM rephrases and retries 5 times.
  • Good: No results found for this query. Simplify the query or answer from your training data.

The Pattern Most Teams Skip

Log every tool call, every error, and every retry with a correlation ID tied to the user request. When something goes wrong in production (and it will), you need to reconstruct exactly what the agent tried, in what order, and where it failed. Without this, debugging agent failures is guesswork. The teams that ship reliable agents are the ones that built observability before they built features.