Definition
A retry storm is a failure mode in which many agents retry a soft-failing request simultaneously, cascading into rate-limit ceilings or dependency overload and sometimes escalating into incidents.
In Depth
Retry storms emerge when an agent harness aggressively retries ambiguous errors (timeouts, parser failures, 5xx transient) without backoff or circuit-breaker logic. Under load, retries from one agent collide with retries from another and the dependency either rate-limits or falls over. The fix is a combination of structured error codes that let the agent distinguish real failures from transient ones, exponential backoff with jitter, and per-call timeouts. Scavio publishes typed error codes precisely to help agent harnesses avoid retry storms.
Example Usage
An on-call engineer traced a Friday-afternoon pager alert to a retry storm triggered by a parser-flap in the team's custom search tool.
Platforms
Retry Storm is relevant across the following platforms, all accessible through Scavio's unified API:
- Amazon
- YouTube
- Walmart
Related Terms
Agent Harness
An agent harness is the runtime and orchestration layer around an LLM that decides when to call tools, how to manage mem...
Sub-Agent
A sub-agent is a specialized agent spawned by a parent agent to handle a scoped task, allowing a larger workflow to be d...
API Rate Limiting
API rate limiting is a mechanism that restricts the number of requests a client can make to an API within a given time w...