langgraphagentspython

Building a Research Agent with LangGraph + Scavio

Build a multi-step research agent using LangGraph and Scavio. Full architecture walkthrough with state management, conditional routing, and real output examples.

14 min read

A single search query rarely gives you the full picture. When you ask an agent to "research the state of AI regulation in 2026," it needs to plan multiple queries, run them, evaluate what's missing, and synthesize findings into a coherent report. This is a research agent -- and LangGraph is the right tool to build one.

In this case study, we build a research agent that takes a topic, generates targeted search queries, runs each through Scavio to get structured SERP data, and produces a cited research summary. We will cover the architecture, the full implementation, and lessons learned from running it on real research tasks.

Why LangGraph Instead of AgentExecutor

LangChain's AgentExecutor works well for single-turn tool use: the LLM decides to call a tool, gets a result, and responds. But research requires multi-step orchestration:

  • Controlled flow -- you decide the sequence of plan, search, and synthesize steps rather than letting the LLM free-wheel
  • State accumulation -- results from each search append to a growing context, not replace it
  • Conditional branching -- after each search, you decide whether to search again or move to synthesis
  • Debuggability -- each node in the graph is inspectable, making it easier to diagnose issues

LangGraph gives you a state machine where each node is a function that reads state, does work, and returns state updates.

Architecture

The agent follows a three-node graph with a conditional loop:

Text
[START] --> [plan] --> [search] --> should_continue?
                                      |              |
                                      |   more       | done
                                      |   queries    |
                                      v              v
                                   [search]    [synthesize] --> [END]
  1. plan -- the LLM breaks the research topic into 3-5 specific, non-overlapping search queries
  2. search -- the next query runs through Scavio and structured results are appended to state
  3. synthesize -- the LLM reads all accumulated results and writes a cited research summary

Prerequisites

Bash
pip install langchain-scavio langgraph langchain-openai
Bash
export SCAVIO_API_KEY="sk_live_..."
export OPENAI_API_KEY="sk-..."

Step 1: Define the State Schema

LangGraph state is a typed dictionary that flows through every node. The key design decision is using Annotated[list[str], add] for the results field -- this tells LangGraph to append new results to the existing list rather than replace it, which is critical for accumulating findings across multiple search iterations.

Python
from typing import TypedDict, Annotated
from operator import add

class ResearchState(TypedDict):
    topic: str                                  # the research question
    queries: list[str]                          # planned search queries
    results: Annotated[list[str], add]          # accumulated search results
    current_query_index: int                    # which query to run next
    summary: str                                # final research summary

Step 2: Plan Node -- Generate Search Queries

The plan node asks the LLM to decompose the research topic into specific, non-overlapping queries. The prompt matters: vague queries return vague results.

Python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

def plan_node(state: ResearchState) -> dict:
    response = llm.invoke(
        f"You are a research assistant. Generate 3-5 specific search queries "
        f"to thoroughly research this topic. Each query should cover a different "
        f"angle. Return ONLY the queries, one per line, no numbering.\n\n"
        f"Topic: {state['topic']}"
    )
    queries = [
        q.strip() for q in response.content.strip().split("\n")
        if q.strip()
    ]
    return {"queries": queries, "current_query_index": 0}

For the topic "How are AI agents being used in healthcare in 2026?", this might produce:

Text
AI agents healthcare diagnosis 2026
clinical trial automation AI agents
AI drug discovery agents latest results
healthcare AI regulations FDA 2026
patient triage AI agent hospital deployment

Step 3: Search Node -- Structured SERP Retrieval

Each search node invocation picks the next query from state and runs it through ScavioSearch. We enable knowledge graphs and People Also Ask to maximize the structured context available for synthesis.

Python
from langchain_scavio import ScavioSearch

search_tool = ScavioSearch(
    max_results=5,
    include_knowledge_graph=True,
    include_questions=True,
    include_related=True,
)

def search_node(state: ResearchState) -> dict:
    idx = state["current_query_index"]
    query = state["queries"][idx]

    result = search_tool.invoke({"query": query})

    return {
        "results": [f"## Search: {query}\n\n{result}"],
        "current_query_index": idx + 1,
    }

Notice that each result is prefixed with the query that produced it. This helps the synthesis step understand where each piece of information came from -- important for generating accurate citations.

Step 4: Synthesize Node -- Research Summary

Once all queries have been executed, the synthesize node produces the final output. The prompt instructs the LLM to cite sources and flag any gaps in the research.

Python
def synthesize_node(state: ResearchState) -> dict:
    all_results = "\n\n---\n\n".join(state["results"])
    response = llm.invoke(
        f"You are a research analyst. Based on the following search results, "
        f"write a comprehensive research summary about: {state['topic']}\n\n"
        f"Requirements:\n"
        f"- Cite sources with URLs when making specific claims\n"
        f"- Note any conflicting information between sources\n"
        f"- Flag areas where more research is needed\n"
        f"- Use clear section headings\n\n"
        f"Search results:\n{all_results}"
    )
    return {"summary": response.content}

Step 5: Routing Logic

After each search, decide whether to search again or move to synthesis:

Python
def should_continue(state: ResearchState) -> str:
    if state["current_query_index"] < len(state["queries"]):
        return "search"
    return "synthesize"

Step 6: Assemble and Run the Graph

Python
from langgraph.graph import StateGraph, END

graph = StateGraph(ResearchState)

# Add nodes
graph.add_node("plan", plan_node)
graph.add_node("search", search_node)
graph.add_node("synthesize", synthesize_node)

# Define edges
graph.set_entry_point("plan")
graph.add_edge("plan", "search")
graph.add_conditional_edges("search", should_continue)
graph.add_edge("synthesize", END)

# Compile
app = graph.compile()
Python
result = app.invoke({
    "topic": "How are AI agents being used in healthcare in 2026?",
    "queries": [],
    "results": [],
    "current_query_index": 0,
    "summary": "",
})

print(result["summary"])

Example Output

Here is an abbreviated version of what the agent produces for the healthcare AI topic:

Text
# AI Agents in Healthcare: 2026 Research Summary

## Diagnosis and Triage
AI agents are being deployed in emergency departments for patient
triage, with several hospital systems reporting 30-40% reduction in
wait times (Source: healthtech-news.example.com). The FDA has approved
12 AI-based diagnostic tools in Q1 2026 alone.

## Drug Discovery
Multi-agent systems are accelerating drug candidate identification.
AI agents now handle literature review, molecular simulation, and
clinical trial matching as coordinated workflows
(Source: pharma-ai.example.com).

## Regulatory Landscape
The FDA's 2026 AI/ML framework requires continuous monitoring for
AI diagnostic tools, moving away from one-time approval processes.

## Gaps Identified
- Limited data on AI agent deployment in rural healthcare settings
- No peer-reviewed studies on long-term patient outcomes with AI triage
- Regulatory frameworks outside the US remain fragmented

Lessons and Best Practices

Query quality is everything

The plan node is the most important part of the pipeline. Generic queries like "AI in healthcare" return generic results. Specific queries like "FDA AI diagnostic tool approvals 2026" return actionable data. Invest time in the planning prompt.

Control token usage with field filters

SERP data can be large. If you are running 5 queries and each returns 10 results with knowledge graphs, PAA, and related searches, the combined context can exceed 20,000 tokens before synthesis. Use max_results and the include_* flags to keep context manageable. For most research tasks, 5 results per query with knowledge graph and PAA enabled is a good balance.

Use related searches for adaptive research

An advanced pattern is to add a fourth node that examines related searches from the SERP data and generates follow-up queries for gaps in the current results. This creates an adaptive research loop where the agent discovers new angles it had not initially considered.

The Simpler Alternative: create_react_agent

If you do not need the three-node architecture and just want an agent that can search when needed, LangGraph's create_react_agent is simpler:

Python
from langgraph.prebuilt import create_react_agent
from langchain_openai import ChatOpenAI
from langchain_scavio import ScavioSearch

agent = create_react_agent(
    ChatOpenAI(model="gpt-4o"),
    tools=[ScavioSearch(max_results=5)],
)

response = agent.invoke({
    "messages": [{"role": "user", "content": "Research AI in healthcare 2026"}]
})

The tradeoff: create_react_agent lets the LLM decide when and how many times to search, which works for simple tasks but gives you less control over multi-step research workflows.

Next Steps