LangGraph Critic Loop: Production Patterns

The critic loop pattern works well for multi-agent research systems until production, where the state space explodes and the critic itself becomes a bottleneck. Production-grade critic loops need narrow scoring dimensions, a max iteration cap, a deterministic scoring layer alongside the LLM, and human-in-the-loop as a final gate.

The critic loop pattern

A researcher agent generates output. A critic agent evaluates it against defined criteria. If the output fails, the researcher revises and resubmits. This iterates until the critic approves or a max iteration limit is hit.

Why it breaks in production

The critic evaluates dimensions it was not designed for, expanding the state space
Without iteration limits, loops run indefinitely burning tokens and credits
LLM-only evaluation is inconsistent: the same output gets different scores on re-evaluation
No audit trail showing why the critic approved or rejected each iteration

Production patterns

Pattern 1: narrow scoring dimensions

Limit the critic to exactly 3 scoring dimensions. For a clinical research agent: source count, recency of sources, and relevance to the query. Each dimension gets a binary or 1-5 score. This prevents scope creep in evaluation.

Pattern 2: hybrid scoring (LLM + deterministic)

Python

def critic_score(output: dict, sources: list) -> dict:
    """Hybrid scoring: deterministic checks + LLM nuance."""
    score = {"pass": True, "reasons": []}

    # Deterministic checks (fast, consistent)
    if len(sources) < 3:
        score["pass"] = False
        score["reasons"].append(f"Only {len(sources)} sources, need 3+")

    stale = [s for s in sources if days_old(s["date"]) > 30]
    if len(stale) > len(sources) // 2:
        score["pass"] = False
        score["reasons"].append("Majority of sources older than 30 days")

    # LLM check (only if deterministic checks pass)
    if score["pass"]:
        relevance = llm_judge_relevance(output["text"], sources)
        if relevance < 0.7:
            score["pass"] = False
            score["reasons"].append(f"Relevance score {relevance:.2f} < 0.7")

    return score

Pattern 3: max iterations with graceful exit

Python

MAX_CRITIC_ITERATIONS = 3

def critic_loop(state: dict) -> dict:
    for i in range(MAX_CRITIC_ITERATIONS):
        score = critic_score(state["output"], state["sources"])
        state["audit_trail"].append({
            "iteration": i + 1,
            "score": score,
            "timestamp": now_iso(),
        })
        if score["pass"]:
            state["status"] = "approved"
            return state
        state["output"] = researcher_revise(state, score["reasons"])

    state["status"] = "max_iterations_reached"
    state["needs_human_review"] = True
    return state

Pattern 4: search verification in the critic

For research agents, the critic should verify citations against live sources. This catches hallucinated references: the critic searches for each cited source and confirms it exists and contains the claimed information.

Audit trail for clinical contexts

LangGraph's checkpointer logs full state at each node, but it is not human-readable by default. For clinical research, surface the audit trail explicitly: which dimensions scored how, why the critic approved or rejected, and what changed between iterations.

Why it breaks in production

The critic evaluates dimensions it was not designed for, expanding the state space

Without iteration limits, loops run indefinitely burning tokens and credits

LLM-only evaluation is inconsistent: the same output gets different scores on re-evaluation

No audit trail showing why the critic approved or rejected each iteration

Pattern 2: hybrid scoring (LLM + deterministic)

Python

def critic_score(output: dict, sources: list) -> dict:
    """Hybrid scoring: deterministic checks + LLM nuance."""
    score = {"pass": True, "reasons": []}

    # Deterministic checks (fast, consistent)
    if len(sources) < 3:
        score["pass"] = False
        score["reasons"].append(f"Only {len(sources)} sources, need 3+")

    stale = [s for s in sources if days_old(s["date"]) > 30]
    if len(stale) > len(sources) // 2:
        score["pass"] = False
        score["reasons"].append("Majority of sources older than 30 days")

    # LLM check (only if deterministic checks pass)
    if score["pass"]:
        relevance = llm_judge_relevance(output["text"], sources)
        if relevance < 0.7:
            score["pass"] = False
            score["reasons"].append(f"Relevance score {relevance:.2f} < 0.7")

    return score

Pattern 3: max iterations with graceful exit

Python

MAX_CRITIC_ITERATIONS = 3

def critic_loop(state: dict) -> dict:
    for i in range(MAX_CRITIC_ITERATIONS):
        score = critic_score(state["output"], state["sources"])
        state["audit_trail"].append({
            "iteration": i + 1,
            "score": score,
            "timestamp": now_iso(),
        })
        if score["pass"]:
            state["status"] = "approved"
            return state
        state["output"] = researcher_revise(state, score["reasons"])

    state["status"] = "max_iterations_reached"
    state["needs_human_review"] = True
    return state

LangGraph Critic Loop: Production Patterns

The critic loop pattern

Why it breaks in production

Production patterns

Pattern 1: narrow scoring dimensions

Pattern 2: hybrid scoring (LLM + deterministic)

Pattern 3: max iterations with graceful exit

Pattern 4: search verification in the critic

Audit trail for clinical contexts

Continue reading

Google Custom Search API Shuts Down Jan 1, 2027: What to Use Instead

Tavily Alternatives After the Nebius Acquisition (2026)

LangGraph Critic Loop: Production Patterns

The critic loop pattern

Why it breaks in production

Production patterns

Pattern 1: narrow scoring dimensions

Pattern 2: hybrid scoring (LLM + deterministic)

Pattern 3: max iterations with graceful exit

Pattern 4: search verification in the critic

Audit trail for clinical contexts

Continue reading

Google Custom Search API Shuts Down Jan 1, 2027: What to Use Instead

Tavily Alternatives After the Nebius Acquisition (2026)