langgraphcriticproduction

LangGraph Critic Loop: Production Patterns

The researcher-critic-HITL pattern in LangGraph catches hallucinated citations before they ship. Production patterns for medical, financial, and legal agents.

9 min

The critic loop pattern works well for multi-agent research systems until production, where the state space explodes and the critic itself becomes a bottleneck. Production-grade critic loops need narrow scoring dimensions, a max iteration cap, a deterministic scoring layer alongside the LLM, and human-in-the-loop as a final gate.

The critic loop pattern

A researcher agent generates output. A critic agent evaluates it against defined criteria. If the output fails, the researcher revises and resubmits. This iterates until the critic approves or a max iteration limit is hit.

Why it breaks in production

  • The critic evaluates dimensions it was not designed for, expanding the state space
  • Without iteration limits, loops run indefinitely burning tokens and credits
  • LLM-only evaluation is inconsistent: the same output gets different scores on re-evaluation
  • No audit trail showing why the critic approved or rejected each iteration

Production patterns

Pattern 1: narrow scoring dimensions

Limit the critic to exactly 3 scoring dimensions. For a clinical research agent: source count, recency of sources, and relevance to the query. Each dimension gets a binary or 1-5 score. This prevents scope creep in evaluation.

Pattern 2: hybrid scoring (LLM + deterministic)

Python
def critic_score(output: dict, sources: list) -> dict:
    """Hybrid scoring: deterministic checks + LLM nuance."""
    score = {"pass": True, "reasons": []}

    # Deterministic checks (fast, consistent)
    if len(sources) < 3:
        score["pass"] = False
        score["reasons"].append(f"Only {len(sources)} sources, need 3+")

    stale = [s for s in sources if days_old(s["date"]) > 30]
    if len(stale) > len(sources) // 2:
        score["pass"] = False
        score["reasons"].append("Majority of sources older than 30 days")

    # LLM check (only if deterministic checks pass)
    if score["pass"]:
        relevance = llm_judge_relevance(output["text"], sources)
        if relevance < 0.7:
            score["pass"] = False
            score["reasons"].append(f"Relevance score {relevance:.2f} < 0.7")

    return score

Pattern 3: max iterations with graceful exit

Python
MAX_CRITIC_ITERATIONS = 3

def critic_loop(state: dict) -> dict:
    for i in range(MAX_CRITIC_ITERATIONS):
        score = critic_score(state["output"], state["sources"])
        state["audit_trail"].append({
            "iteration": i + 1,
            "score": score,
            "timestamp": now_iso(),
        })
        if score["pass"]:
            state["status"] = "approved"
            return state
        state["output"] = researcher_revise(state, score["reasons"])

    state["status"] = "max_iterations_reached"
    state["needs_human_review"] = True
    return state

Pattern 4: search verification in the critic

For research agents, the critic should verify citations against live sources. This catches hallucinated references: the critic searches for each cited source and confirms it exists and contains the claimed information.

Audit trail for clinical contexts

LangGraph's checkpointer logs full state at each node, but it is not human-readable by default. For clinical research, surface the audit trail explicitly: which dimensions scored how, why the critic approved or rejected, and what changed between iterations.