Neo4j Knowledge Graphs for Generative Engine Optimization
Build a Neo4j GEO pipeline with Scavio. Schema, ingestion, and the three Cypher queries that do most of the real work.
A Neo4j case study posted to r/Agent_SEO and cross-posted to r/eCommerceSEO made the rounds this week: the author used a knowledge graph to drive generative search visibility and saw measurable lift. The technique is underrated. This post is the generalized version, with a working ingestion pipeline using Scavio.
Why Entity Graphs Beat Keywords for GEO
Keyword SEO optimized for a single string. Generative engines retrieve by entity. When ChatGPT, Claude, or Perplexity answer a query, they decompose it into entities, find citations for each entity, and compose. A brand that has no entity-level citation density across the web simply does not show up.
A knowledge graph in Neo4j models the brand, its products, competing products, topical authorities, and citers as nodes. Edges represent mentions, citations, rankings, and relationships. The graph lets a team identify gaps: which competitor product is mentioned in ten more Reddit threads than ours? Which topic authority has never cited us?
The Schema
Four node labels cover most GEO use cases:
- Entity: brand, product, topic, concept.
- Citer: publication, Reddit author, YouTube channel, influencer.
- Surface: Google AI Overviews, Perplexity, ChatGPT citations, Reddit thread.
- Query: the user intent string that reaches a generative engine.
Edges carry weight and timestamp:
(Citer)-[:CITES {weight, first_seen, last_seen}]->(Entity)(Query)-[:RETURNS]->(Surface)-[:LISTS]->(Entity)(Entity)-[:COMPETES_WITH]->(Entity)
Ingestion with Scavio
Populate the graph from a query list. For each query, Scavio returns typed JSON across Google SERP (with AI Overviews when present), Reddit threads, and YouTube results. Every result becomes a Cypher INSERT.
import os, requests
from neo4j import GraphDatabase
API_KEY = os.environ['SCAVIO_API_KEY']
driver = GraphDatabase.driver('bolt://localhost:7687',
auth=('neo4j', os.environ['NEO4J_PASSWORD']))
def ingest(query):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': query, 'include_ai_overview': True})
data = r.json()
with driver.session() as sess:
# Create Query node
sess.run("MERGE (q:Query {text: $q})", q=query)
# AI Overview citations
ao = data.get('ai_overview', {})
for citation in ao.get('citations', []):
sess.run("""
MATCH (q:Query {text: $q})
MERGE (e:Entity {url: $url})
SET e.title = $title
MERGE (q)-[:CITED_IN_AO {surface: 'ai_overviews'}]->(e)
""", q=query, url=citation['url'], title=citation.get('title', ''))
# Organic results
for result in data.get('organic_results', [])[:10]:
sess.run("""
MATCH (q:Query {text: $q})
MERGE (e:Entity {url: $url})
SET e.title = $title
MERGE (q)-[:SERP_RESULT {rank: $rank}]->(e)
""", q=query, url=result['link'],
title=result['title'], rank=result.get('position', 0))Reddit as a Leading Indicator
Reddit citations today correlate with LLM answer citations in 60 to 90 days. Ingest Reddit threads into the graph as a separate surface and the team can predict which entities will win generative visibility before they do.
def ingest_reddit(query):
r = requests.post('https://api.scavio.dev/api/v1/search',
headers={'x-api-key': API_KEY},
json={'query': query, 'platform': 'reddit'})
with driver.session() as sess:
for post in r.json().get('posts', []):
sess.run("""
MATCH (q:Query {text: $q})
MERGE (e:Entity {url: $url})
SET e.title = $title
MERGE (q)-[:REDDIT_MENTION {score: $score}]->(e)
""", q=query, url=post['url'], title=post['title'],
score=post.get('score', 0))The Useful Cypher Queries
Once the graph is populated, three Cypher queries do most of the real GEO work:
// Competitor gap analysis: entities cited for competitor queries but not ours
MATCH (brand:Entity {name: 'OurBrand'})
MATCH (comp:Entity {name: 'CompetitorBrand'})
MATCH (comp)<-[:CITED_IN_AO]-(q:Query)
WHERE NOT (brand)<-[:CITED_IN_AO]-(q)
RETURN q.text, count(*) as severity
ORDER BY severity DESC LIMIT 20;
// Reddit leading indicators: high Reddit signal, no AI citation yet
MATCH (e:Entity)<-[r:REDDIT_MENTION]-(q:Query)
WHERE NOT (e)<-[:CITED_IN_AO]-(q)
RETURN e.title, sum(r.score) as reddit_score
ORDER BY reddit_score DESC LIMIT 20;
// Topical authority map: who cites us across which surfaces?
MATCH (c:Citer)-[r]->(e:Entity {name: 'OurBrand'})
RETURN c.name, type(r) as surface, count(*) as citations
ORDER BY citations DESC;Why This Works for eCommerce Specifically
eCommerce brands live in entity-rich categories. A product has a name, category, attributes, competitors, and a price. The graph captures all of it, and generative engines retrieve eCommerce queries heavily by entity. A site that surfaces its products as first-class graph entities wins the citation battle versus a site that only has keyword pages.
Operational Cost
For a 500-product catalog with 20 related queries each, the daily enrichment cost on Scavio is roughly 10,000 queries, which fits the $30/mo plan with room to spare. Neo4j AuraDB free tier hosts the graph. The full stack lands under $50/mo before the team adds an LLM for composition work.
Pair with the best API for Neo4j GEO pipelines comparison when picking the ingestion layer.