qwenlocal-llmsimpleqascavio

Qwen3 27B + Agentic Search: 95.7% SimpleQA on a Single 3090 (2026)

Qwen3.6-27B hit 95.7% SimpleQA with agentic search on one RTX 3090. The search provider matters as much as the model. Setup and honest tradeoffs.

5 min read

A LocalLLaMA post reported Qwen3.6-27B hitting 95.7% on the SimpleQA benchmark with agentic search, running fully local on a single RTX 3090. The key: the model calls a search tool for every factual question instead of relying on parametric memory. The search tool matters as much as the model.

Why SimpleQA accuracy depends on the search provider

SimpleQA tests factual recall with questions like "Who won the 2024 Nobel Prize in Physics?" A local 27B model without search gets ~40-50% on these. With search, it gets 95%+. But the accuracy ceiling is set by the search provider: if the search returns outdated or irrelevant snippets, the model hallucinates on top of bad context instead of bad memory.

Reproducing the setup

Python
import requests, os

def scavio_search(query: str) -> str:
    resp = requests.post("https://api.scavio.dev/api/v1/search",
        headers={"x-api-key": os.environ["SCAVIO_API_KEY"]},
        json={"query": query, "platform": "google", "limit": 5})
    results = resp.json().get("results", [])
    return "\n".join(f"- {r['title']}: {r['snippet']}" for r in results)

# Wire into Ollama tool-calling:
# 1. Define scavio_search as an available tool
# 2. System prompt: "For factual questions, call the search tool first"
# 3. Model calls search → gets context → answers with citation

3090 vs 4090 for this workload

Qwen 27B at 4-bit quantization needs ~16GB VRAM. A 3090 (24GB) handles it with room for KV cache. A 4090 is faster but not required. The bottleneck is search latency, not inference speed. Scavio returns results in ~200ms; the model takes 2-4 seconds to generate. Search is not the slow part.

The honest tradeoff

95.7% SimpleQA is impressive for a local model, but SimpleQA is a narrow benchmark: short factual questions with unambiguous answers. For multi-hop reasoning, long-form analysis, or questions where the answer is not in the first search result, the accuracy drops. The win is real but scoped: local LLM + search grounding handles factual Q&A well. It does not replace cloud models for complex reasoning.