Agent Retry Storms Are Coming for Your API Rate Limits
Why agent retry storms happen and how to design rate-limit-friendly architectures that prevent cascading failures.
You ship an agent that calls an external API. It works in testing. Then in production, one slow response triggers a retry, the retry triggers another retry, and suddenly your agent is hammering the API with hundreds of requests per second. You have created a retry storm.
Retry storms are one of the most common failure modes in agent systems, and they are almost always preventable with the right architecture.
How Retry Storms Start
Most agent frameworks retry failed tool calls automatically. When an API returns a 429 (rate limit) or 503 (service unavailable), the agent tries again. The problem is that naive retries make rate limiting worse, not better.
Here is the typical sequence:
- Agent sends request, gets 429 back
- Agent retries immediately
- API is still rate-limited, returns another 429
- Agent retries again, possibly with a new tool call variation
- LLM decides to "try a different approach" -- sends even more requests
- The API bans your key or the agent exhausts your credit balance
The LLM layer makes this worse because the model interprets failures as signals to try harder, not to back off. Unlike traditional retry logic, an LLM might generate entirely new requests in response to a rate limit error, multiplying the problem.
Exponential Backoff Is Not Enough
The standard advice is "use exponential backoff." That helps, but it does not solve the problem for agents. Exponential backoff works when you have a single retry loop. Agents have multiple layers that can each independently decide to retry.
# This looks correct but fails in practice
# The agent framework may add its own retry on top
import time
def call_with_backoff(fn, max_retries=3):
for attempt in range(max_retries):
try:
return fn()
except RateLimitError:
wait = 2 ** attempt
time.sleep(wait)
raise Exception("Max retries exceeded")If your tool wrapper retries 3 times, and the agent framework retries failed tools 3 times, and the LLM generates 2 alternative tool calls, you get 3 x 3 x 2 = 18 attempts from what the user sees as a single action.
Design Patterns That Prevent Storms
The fix is to centralize rate-limit awareness instead of scattering retry logic across layers.
- Single retry boundary: Pick one layer to handle retries and disable retries everywhere else. The tool wrapper is usually the right place.
- Rate limit budget: Track remaining requests globally. Before calling any tool, check the budget. If it is exhausted, return an error to the LLM that says "rate limit reached, do not retry" rather than a generic failure.
- Circuit breaker: After N consecutive failures, stop all outbound requests for a cooldown period. This prevents cascade failures across multiple tools hitting the same API.
class RateLimitBudget:
def __init__(self, max_per_minute: int):
self.max_per_minute = max_per_minute
self.calls: list[float] = []
def can_call(self) -> bool:
now = time.time()
self.calls = [t for t in self.calls if now - t < 60]
return len(self.calls) < self.max_per_minute
def record_call(self):
self.calls.append(time.time())Tell the LLM About Rate Limits
A critical step most developers skip: make the error message LLM-friendly. When you return a rate limit error to the agent, include explicit instructions in the error text.
{
"error": "Rate limit reached. Wait 30 seconds before retrying. Do not attempt alternative queries.",
"retry_after_seconds": 30,
"remaining_budget": 0
}LLMs follow instructions in error messages. A message that says "rate limit exceeded" gets interpreted as "try something else." A message that says "stop and wait 30 seconds" actually works.
Monitor Before It Becomes a Problem
Retry storms are easiest to catch with simple metrics:
- Track requests-per-minute per API key
- Alert when 429 responses exceed 10% of total requests
- Log the full retry chain so you can see which layer is retrying
- Set a hard ceiling on total API calls per agent session
Services like Scavio include usage tracking via the get_usage tool, which lets your agent check its remaining credit balance before making expensive calls. Building this self-awareness into the agent loop is the most reliable way to prevent runaway consumption.
Summary
Retry storms happen when multiple layers of an agent system independently retry failed requests. The fix is not more sophisticated retry logic -- it is centralizing rate-limit awareness, using circuit breakers, and giving the LLM clear instructions about when to stop. Design your agent to fail gracefully instead of failing loudly and expensively.