通过应用四种技术来减少 AI 代理中的搜索 API 延迟:多查询工作流的并行请求、重复查询的结果缓存、减少负载大小的查询修剪以及消除握手开销的连接池。在典型的代理工作流程中,搜索调用占总响应时间的 60-80%。即使是很小的延迟减少也会在多步推理链中复合。本教程使用 Scavio API 实现每项优化并测量前后影响。
前置条件
- 已安装 Python 3.8+
- 请求已安装库
- 来自 scavio.dev 的 Scavio API 密钥
- 具有搜索调用的现有代理工作流程
操作指南
步骤 1: 测量基线延迟
通过计时连续搜索调用来建立基线,以便您可以衡量每次优化的影响。
Python
import requests, os, time
from concurrent.futures import ThreadPoolExecutor
API_KEY = os.environ['SCAVIO_API_KEY']
SESSION = requests.Session()
SESSION.headers.update({'x-api-key': API_KEY})
def timed_search(query: str) -> tuple:
start = time.monotonic()
resp = SESSION.post('https://api.scavio.dev/api/v1/search',
json={'platform': 'google', 'query': query}, timeout=10)
latency = (time.monotonic() - start) * 1000
return query, round(latency, 1), len(resp.json().get('organic_results', []))
# Baseline: sequential
queries = ['best crm 2026', 'python async tutorial', 'react vs vue']
start = time.monotonic()
for q in queries:
_, ms, _ = timed_search(q)
print(f'{q}: {ms}ms')
print(f'Sequential total: {(time.monotonic() - start)*1000:.0f}ms')步骤 2: 并行化多查询请求
使用线程池同时发送多个搜索请求,将总挂钟时间减少 2-3 倍。
Python
def parallel_search(queries: list, max_workers: int = 3) -> list:
start = time.monotonic()
with ThreadPoolExecutor(max_workers=max_workers) as pool:
results = list(pool.map(timed_search, queries))
total = (time.monotonic() - start) * 1000
for q, ms, count in results:
print(f'{q}: {ms}ms ({count} results)')
print(f'Parallel total: {total:.0f}ms')
return results
parallel_search(queries)步骤 3: 添加带有 TTL 的结果缓存
通过具有生存时间的查询字符串缓存搜索结果,以避免重复查询的冗余 API 调用。
Python
import hashlib
cache = {}
CACHE_TTL = 300 # seconds
def cached_search(query: str, platform: str = 'google') -> dict:
key = hashlib.md5(f'{platform}:{query}'.encode()).hexdigest()
now = time.time()
if key in cache and now - cache[key]['ts'] < CACHE_TTL:
return cache[key]['data']
resp = SESSION.post('https://api.scavio.dev/api/v1/search',
json={'platform': platform, 'query': query}, timeout=10)
data = resp.json()
cache[key] = {'data': data, 'ts': now}
return data
# First call: network
start = time.monotonic()
cached_search('best crm 2026')
print(f'First call: {(time.monotonic() - start)*1000:.0f}ms')
# Second call: cache
start = time.monotonic()
cached_search('best crm 2026')
print(f'Cache hit: {(time.monotonic() - start)*1000:.0f}ms')步骤 4: 修剪响应负载
在将搜索结果传递到 LLM 上下文之前,从搜索结果中去除不必要的字段,以减少令牌处理时间。
Python
def pruned_search(query: str) -> list:
data = cached_search(query)
results = data.get('organic_results', [])
return [{
'title': r.get('title', ''),
'snippet': r.get('snippet', '')[:200],
'url': r.get('link', ''),
} for r in results[:5]]
# Compare payload sizes:
import json
full = cached_search('best crm 2026')
pruned = pruned_search('best crm 2026')
print(f'Full response: {len(json.dumps(full))} chars')
print(f'Pruned response: {len(json.dumps(pruned))} chars')
print(f'Reduction: {100 - len(json.dumps(pruned)) * 100 // len(json.dumps(full))}%')Python 示例
Python
import requests, os, time, hashlib
from concurrent.futures import ThreadPoolExecutor
S = requests.Session()
S.headers.update({'x-api-key': os.environ['SCAVIO_API_KEY']})
cache = {}
def fast_search(query):
key = hashlib.md5(query.encode()).hexdigest()
if key in cache and time.time() - cache[key]['ts'] < 300:
return cache[key]['data']
data = S.post('https://api.scavio.dev/api/v1/search',
json={'platform': 'google', 'query': query}).json()
cache[key] = {'data': data, 'ts': time.time()}
return data
def parallel(queries):
with ThreadPoolExecutor(3) as pool:
return list(pool.map(fast_search, queries))JavaScript 示例
JavaScript
const cache = new Map();
async function fastSearch(query) {
const key = query;
const cached = cache.get(key);
if (cached && Date.now() - cached.ts < 300000) return cached.data;
const r = await fetch('https://api.scavio.dev/api/v1/search', {
method: 'POST',
headers: {'x-api-key': process.env.SCAVIO_API_KEY, 'Content-Type': 'application/json'},
body: JSON.stringify({platform: 'google', query})
});
const data = await r.json();
cache.set(key, {data, ts: Date.now()});
return data;
}
async function parallel(queries) {
return Promise.all(queries.map(fastSearch));
}
parallel(['best crm 2026', 'react tutorial']).then(r => console.log(r.length + ' results'));预期输出
JSON
Measurable latency reductions: parallel requests cut total time by 2-3x, caching eliminates repeated calls, and payload pruning reduces downstream token processing.