local-llmqwengrounding

Why Qwen Hallucinates on Web Search (and the Fix)

An r/LocalLLaMA post documented the fix. Local LLMs hallucinate worse than cloud models for structural reasons — typed JSON cuts it.

5 min read

An r/LocalLLaMA post titled "If anyone is running qwen 9b or 27b or 35b and getting wrong facts while web search, follow this" landed 47 upvotes and 28 comments. The fix is real and worth understanding — local LLMs hallucinate on web-grounded answers for structural reasons that cloud LLMs hide.

Why local LLMs hallucinate worse than cloud LLMs

Two structural reasons:

  • Tighter context windows. Qwen 27B at 32K context. Llama-3 8B at 8K-32K. Cloud Claude/GPT routinely run 200K+. When the agent feeds raw scraped HTML (25-40K tokens for a 60KB page), the local LLM has no room for instructions, citation rules, or multi-source comparison.
  • Smaller training, less robustness to noisy input. Frontier models smooth over noisy retrieval; smaller models propagate the noise into the answer.

The fix has two parts

First, the input shape. Raw HTML is the wrong input. Typed JSON sources (Scavio's organic_results) compress 10x: ~1.5K tokens for 10 sources vs 25-40K for raw HTML.

Second, the prompt structure. Local LLMs respond to explicit, numbered citation requirements better than soft instructions.

The minimum-viable grounding stack

Python
import requests, os
H = {'x-api-key': os.environ['SCAVIO_API_KEY']}

PROMPT = '''Answer using ONLY the sources below.
Every claim must be followed by [N] where N is the source number.
If the sources do not answer the question, say "I don't know based on the provided sources."

Sources:
{sources}

Question: {question}'''

def search(q):
    r = requests.post('https://api.scavio.dev/api/v1/search',
        headers=H, json={'query': q}).json()
    return r.get('organic_results', [])[:10]

def fmt_sources(results):
    return '\n'.join(
        f'[{i+1}] {r["title"]} ({r["link"]}): {r["snippet"]}'
        for i, r in enumerate(results)
    )

def ask_qwen(q):
    sources = search(q)
    prompt = PROMPT.format(sources=fmt_sources(sources), question=q)
    r = requests.post('http://localhost:11434/api/generate',
        json={'model': 'qwen2.5:32b', 'prompt': prompt, 'stream': False}).json()
    return r['response']

The cross-check pattern

Add a second call to Scavio with include_ai_overview: true. Compare the local LLM's claims against Google's AI Overview citation set. Disagreement is a hallucination flag — surface to the user instead of presenting as fact.

Why this works

Three reinforcing effects:

  • Context budget is preserved for actual reasoning, not parsing
  • The numbered source list gives the LLM unambiguous reference points
  • The "say I don't know" clause gives an out for unsupported questions; without it, the LLM fabricates rather than admit gaps

Reported impact in practice

In the r/LocalLLaMA post, Qwen 27B + raw HTML grounding produced ~18% factual hallucination rate. The same model with typed JSON + the strict citation prompt dropped to under 3% on the same benchmark.

The fix isn't a bigger model. It's the input shape and prompt discipline.

The honest constraints

Local LLMs still trail cloud LLMs on truly open-ended synthesis. Grounded retrieval narrows the gap; it doesn't close it. For high-stakes factual questions on broad topics, frontier cloud models still win.

For tight, focused queries (single domain, narrow topic, well-formed question) with grounded sources, local Qwen/Llama/DeepSeek with this pattern is production-viable in 2026.

Stack cost

Scavio Project tier $30/mo + Ollama (free, local) + a GPU you already own = the entire grounding stack. Compared to cloud-LLM-only at typical Claude/GPT API rates, this is the privacy-friendly, cost-controlled option for teams that already have local inference running.