local-llmmcpsetup

Setting Up Web Search for Local LLMs via MCP

How to configure MCP servers and search APIs to give local LLMs real-time web search capabilities.

8 min read

Running a local LLM gives you privacy, zero per-token costs, and full control over your inference stack. But local models are cut off from the live web by default. To give them real-time search capabilities, you need a bridge between the model and a search API. The Model Context Protocol (MCP) provides exactly that -- a standardized way to connect tools to LLMs without writing custom integration code.

What You Need

The setup has three layers: a local LLM with tool-calling support, an MCP client that manages tool routing, and an MCP-compatible search server. Here is the stack:

  • A local model: Llama 3.3, Qwen 2.5, Mistral, or similar via Ollama or llama.cpp
  • An MCP client: Open-source options include Open WebUI, LMStudio, or custom Python scripts using the MCP SDK
  • A search MCP server: Scavio's hosted server at https://mcp.scavio.dev/mcp

The key requirement is that your local model supports function calling (also called tool use). Not all models do -- more on this below.

Setting Up Ollama with Tool Calling

Ollama is the simplest way to run local models. Install it, pull a model with tool-calling support, and verify it works:

Bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model with strong tool-calling support
ollama pull llama3.3

# Verify the model loads
ollama run llama3.3 "What is 2+2?"

Ollama exposes an OpenAI-compatible API on localhost:11434. Your MCP client will communicate with this endpoint.

Connecting via MCP

The MCP Python SDK lets you build a lightweight client that bridges your local LLM and the search server. Here is a minimal example:

Python
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client

async def search_web(query: str):
    async with streamablehttp_client(
        url="https://mcp.scavio.dev/mcp",
        headers={"x-api-key": "YOUR_API_KEY"}
    ) as (read, write, _):
        async with ClientSession(read, write) as session:
            await session.initialize()
            result = await session.call_tool(
                "search_google",
                arguments={"query": query}
            )
            return result.content

This connects to Scavio's MCP server over HTTP, calls thesearch_google tool, and returns structured results. Your local LLM decides when to call this function based on the user's question.

Tool Registration and Routing

For the LLM to use search tools, it needs to know they exist. The MCP client lists available tools from the server and formats them as function definitions the model understands:

Python
# List all available tools from the MCP server
tools = await session.list_tools()

# Convert to OpenAI-compatible function format
functions = []
for tool in tools.tools:
    functions.append({
        "name": tool.name,
        "description": tool.description,
        "parameters": tool.inputSchema
    })

# Pass to your local LLM via Ollama's API
# The model will generate tool calls when appropriate

When the model generates a tool call, your client intercepts it, routes it to the MCP server, and feeds the result back into the conversation context.

Choosing the Right Local Model

Not all local models handle tool calling well. Models that work reliably for this setup as of 2026:

  • Llama 3.3 70B -- best overall tool-calling accuracy at the open-weight tier
  • Qwen 2.5 72B -- strong function calling with good instruction following
  • Mistral Large -- solid tool use with lower resource requirements
  • Llama 3.3 8B -- usable for simple single-tool calls on consumer hardware

Smaller models (7B-8B) can make tool calls but often hallucinate parameter names or generate malformed JSON. For production use, 70B+ models are significantly more reliable.

End-to-End Flow

The complete flow works like this: the user asks a question, the local LLM recognizes it needs web data, generates a tool call, the MCP client routes the call to the search server, the server returns structured JSON, and the LLM incorporates the results into its response. All inference stays local. Only the search query goes to the API -- your prompts and conversations never leave your machine.