The Agent Harness Is Harder Than the LLM Integration
Why tool orchestration, error handling, and retries are harder than the LLM integration itself in agent systems.
Most teams building AI agents spend their first week on the LLM integration -- choosing a model, writing prompts, tuning temperature. Then they spend the next three months on everything else: tool orchestration, error recovery, state management, and deployment. The agent harness, not the LLM call, is where the real complexity lives.
This post explains why the harness is harder than the model, and what to focus on when building production agents.
The LLM Call Is the Easy Part
Calling an LLM is a solved problem. Every major provider has an SDK. The API is a single HTTP endpoint that accepts messages and returns completions. You can get a working LLM call in under 10 lines of code.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Summarize this article"}]
)
print(response.content[0].text)That is the entire LLM integration. It works, it scales, and it rarely breaks. The hard part is everything that wraps around it.
Tool Orchestration Is Where Complexity Explodes
An agent is not just an LLM -- it is an LLM connected to tools. Each tool adds a new failure mode, a new latency source, and a new state transition. Orchestrating tool calls means handling:
- Sequential tool calls where each depends on the previous result
- Parallel tool calls that need to be fanned out and merged
- Tool calls that return errors the LLM must recover from
- Tool calls that succeed but return unexpected data formats
- Tool calls that time out after 30 seconds of silence
A simple agent that searches Google and then looks up product details on Amazon requires handling at least 6 different failure scenarios per tool, and the interaction between failures compounds quickly.
Error Handling Requires Human-Level Judgment
Traditional error handling follows deterministic rules: if status 429, retry with backoff. Agent error handling is different because the LLM makes non-deterministic decisions about how to recover.
# Traditional error handling: predictable
try:
result = api.search(query)
except RateLimitError:
time.sleep(60)
result = api.search(query)
# Agent error handling: unpredictable
# The LLM might:
# 1. Retry the same query
# 2. Rephrase the query
# 3. Try a different tool entirely
# 4. Give up and hallucinate an answer
# 5. Ask the user for clarification
# You need guardrails for all five pathsYou cannot write a simple try/except for an agent. You need to constrain what the LLM is allowed to do when a tool fails, which means designing error-handling policies that are both flexible enough for the LLM and rigid enough to prevent bad behavior.
State Management Across Turns
Agents are stateful. They need to track what tools have been called, what results were returned, what the user's original intent was, and how much budget remains. This state must persist across multiple LLM calls within a single session.
- Conversation history grows with each turn, hitting context limits
- Tool results must be stored and referenced across turns
- Budget tracking (API credits, token usage) must be accurate
- Session state must survive process restarts for long-running agents
Most agent frameworks punt on state management, storing everything in an in-memory list that is lost on restart. Production agents need durable state -- a database or message queue that survives failures.
What to Focus On First
If you are building a production agent, spend your time on the harness, not the prompts. Specifically:
- Use structured APIs over raw scraping: Services like Scavio return clean JSON from Google, Amazon, YouTube, Walmart, and Reddit. This eliminates an entire category of parsing failures from your harness.
- Centralize error handling: Define a single error policy that all tools follow. Do not let each tool implement its own retry logic.
- Set hard limits: Cap the number of tool calls per session, the total tokens consumed, and the wall-clock time. An agent without limits will eventually surprise you.
- Log everything: Every tool call, every LLM response, every state transition. You will need this data to debug the failures that your error handling missed.
The Harness Is the Product
The LLM provides intelligence. The harness provides reliability. Users do not care which model you use -- they care whether the agent gives correct answers and does not break. Investing in the harness pays off faster than investing in prompt engineering, because harness improvements are deterministic and testable while prompt changes are probabilistic and fragile.