The Agent Harness Is Harder Than the LLM Integration

Most teams building AI agents spend their first week on the LLM integration -- choosing a model, writing prompts, tuning temperature. Then they spend the next three months on everything else: tool orchestration, error recovery, state management, and deployment. The agent harness, not the LLM call, is where the real complexity lives.

This post explains why the harness is harder than the model, and what to focus on when building production agents.

The LLM Call Is the Easy Part

Calling an LLM is a solved problem. Every major provider has an SDK. The API is a single HTTP endpoint that accepts messages and returns completions. You can get a working LLM call in under 10 lines of code.

Python

from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this article"}]
)
print(response.content[0].text)

That is the entire LLM integration. It works, it scales, and it rarely breaks. The hard part is everything that wraps around it.

Tool Orchestration Is Where Complexity Explodes

An agent is not just an LLM -- it is an LLM connected to tools. Each tool adds a new failure mode, a new latency source, and a new state transition. Orchestrating tool calls means handling:

Sequential tool calls where each depends on the previous result
Parallel tool calls that need to be fanned out and merged
Tool calls that return errors the LLM must recover from
Tool calls that succeed but return unexpected data formats
Tool calls that time out after 30 seconds of silence

A simple agent that searches Google and then looks up product details on Amazon requires handling at least 6 different failure scenarios per tool, and the interaction between failures compounds quickly.

Error Handling Requires Human-Level Judgment

Traditional error handling follows deterministic rules: if status 429, retry with backoff. Agent error handling is different because the LLM makes non-deterministic decisions about how to recover.

Python

# Traditional error handling: predictable
try:
    result = api.search(query)
except RateLimitError:
    time.sleep(60)
    result = api.search(query)

# Agent error handling: unpredictable
# The LLM might:
# 1. Retry the same query
# 2. Rephrase the query
# 3. Try a different tool entirely
# 4. Give up and hallucinate an answer
# 5. Ask the user for clarification
# You need guardrails for all five paths

You cannot write a simple try/except for an agent. You need to constrain what the LLM is allowed to do when a tool fails, which means designing error-handling policies that are both flexible enough for the LLM and rigid enough to prevent bad behavior.

State Management Across Turns

Agents are stateful. They need to track what tools have been called, what results were returned, what the user's original intent was, and how much budget remains. This state must persist across multiple LLM calls within a single session.

Conversation history grows with each turn, hitting context limits
Tool results must be stored and referenced across turns
Budget tracking (API credits, token usage) must be accurate
Session state must survive process restarts for long-running agents

Most agent frameworks punt on state management, storing everything in an in-memory list that is lost on restart. Production agents need durable state -- a database or message queue that survives failures.

What to Focus On First

If you are building a production agent, spend your time on the harness, not the prompts. Specifically:

Use structured APIs over raw scraping: Services like Scavio return clean JSON from Google, Amazon, YouTube, Walmart, and Reddit. This eliminates an entire category of parsing failures from your harness.
Centralize error handling: Define a single error policy that all tools follow. Do not let each tool implement its own retry logic.
Set hard limits: Cap the number of tool calls per session, the total tokens consumed, and the wall-clock time. An agent without limits will eventually surprise you.
Log everything: Every tool call, every LLM response, every state transition. You will need this data to debug the failures that your error handling missed.

The Harness Is the Product

The LLM provides intelligence. The harness provides reliability. Users do not care which model you use -- they care whether the agent gives correct answers and does not break. Investing in the harness pays off faster than investing in prompt engineering, because harness improvements are deterministic and testable while prompt changes are probabilistic and fragile.

This post explains why the harness is harder than the model, and what to focus on when building production agents.

The LLM Call Is the Easy Part

Python

from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Summarize this article"}]
)
print(response.content[0].text)

That is the entire LLM integration. It works, it scales, and it rarely breaks. The hard part is everything that wraps around it.

Tool Orchestration Is Where Complexity Explodes

An agent is not just an LLM -- it is an LLM connected to tools. Each tool adds a new failure mode, a new latency source, and a new state transition. Orchestrating tool calls means handling:

Sequential tool calls where each depends on the previous result
Parallel tool calls that need to be fanned out and merged
Tool calls that return errors the LLM must recover from
Tool calls that succeed but return unexpected data formats
Tool calls that time out after 30 seconds of silence

Error Handling Requires Human-Level Judgment

Traditional error handling follows deterministic rules: if status 429, retry with backoff. Agent error handling is different because the LLM makes non-deterministic decisions about how to recover.

Python

# Traditional error handling: predictable
try:
    result = api.search(query)
except RateLimitError:
    time.sleep(60)
    result = api.search(query)

# Agent error handling: unpredictable
# The LLM might:
# 1. Retry the same query
# 2. Rephrase the query
# 3. Try a different tool entirely
# 4. Give up and hallucinate an answer
# 5. Ask the user for clarification
# You need guardrails for all five paths

State Management Across Turns

Conversation history grows with each turn, hitting context limits
Tool results must be stored and referenced across turns
Budget tracking (API credits, token usage) must be accurate
Session state must survive process restarts for long-running agents

What to Focus On First

If you are building a production agent, spend your time on the harness, not the prompts. Specifically:

Use structured APIs over raw scraping: Services like Scavio return clean JSON from Google, Amazon, YouTube, Walmart, and Reddit. This eliminates an entire category of parsing failures from your harness.
Centralize error handling: Define a single error policy that all tools follow. Do not let each tool implement its own retry logic.
Set hard limits: Cap the number of tool calls per session, the total tokens consumed, and the wall-clock time. An agent without limits will eventually surprise you.
Log everything: Every tool call, every LLM response, every state transition. You will need this data to debug the failures that your error handling missed.

The Agent Harness Is Harder Than the LLM Integration

The LLM Call Is the Easy Part

Tool Orchestration Is Where Complexity Explodes

Error Handling Requires Human-Level Judgment

State Management Across Turns

What to Focus On First

The Harness Is the Product

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters

The Agent Harness Is Harder Than the LLM Integration

The LLM Call Is the Easy Part

Tool Orchestration Is Where Complexity Explodes

Error Handling Requires Human-Level Judgment

State Management Across Turns

What to Focus On First

The Harness Is the Product

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters