Neo4j Knowledge Graphs for Generative Engine Optimization

A Neo4j case study posted to r/Agent_SEO and cross-posted to r/eCommerceSEO made the rounds this week: the author used a knowledge graph to drive generative search visibility and saw measurable lift. The technique is underrated. This post is the generalized version, with a working ingestion pipeline using Scavio.

Why Entity Graphs Beat Keywords for GEO

Keyword SEO optimized for a single string. Generative engines retrieve by entity. When ChatGPT, Claude, or Perplexity answer a query, they decompose it into entities, find citations for each entity, and compose. A brand that has no entity-level citation density across the web simply does not show up.

A knowledge graph in Neo4j models the brand, its products, competing products, topical authorities, and citers as nodes. Edges represent mentions, citations, rankings, and relationships. The graph lets a team identify gaps: which competitor product is mentioned in ten more Reddit threads than ours? Which topic authority has never cited us?

The Schema

Four node labels cover most GEO use cases:

Entity: brand, product, topic, concept.
Citer: publication, Reddit author, YouTube channel, influencer.
Surface: Google AI Overviews, Perplexity, ChatGPT citations, Reddit thread.
Query: the user intent string that reaches a generative engine.

Edges carry weight and timestamp:

{`(Citer)-[:CITES {weight, first_seen, last_seen}]->(Entity)`}
(Query)-[:RETURNS]->(Surface)-[:LISTS]->(Entity)
(Entity)-[:COMPETES_WITH]->(Entity)

Ingestion with Scavio

Populate the graph from a query list. For each query, Scavio returns typed JSON across Google SERP (with AI Overviews when present), Reddit threads, and YouTube results. Every result becomes a Cypher INSERT.

Python

import os, requests
from neo4j import GraphDatabase

API_KEY = os.environ['SCAVIO_API_KEY']
driver = GraphDatabase.driver('bolt://localhost:7687',
    auth=('neo4j', os.environ['NEO4J_PASSWORD']))

def ingest(query):
    r = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': query, 'include_ai_overview': True})
    data = r.json()

    with driver.session() as sess:
        # Create Query node
        sess.run("MERGE (q:Query {text: $q})", q=query)

        # AI Overview citations
        ao = data.get('ai_overview', {})
        for citation in ao.get('citations', []):
            sess.run("""
                MATCH (q:Query {text: $q})
                MERGE (e:Entity {url: $url})
                SET e.title = $title
                MERGE (q)-[:CITED_IN_AO {surface: 'ai_overviews'}]->(e)
            """, q=query, url=citation['url'], title=citation.get('title', ''))

        # Organic results
        for result in data.get('organic_results', [])[:10]:
            sess.run("""
                MATCH (q:Query {text: $q})
                MERGE (e:Entity {url: $url})
                SET e.title = $title
                MERGE (q)-[:SERP_RESULT {rank: $rank}]->(e)
            """, q=query, url=result['link'],
                title=result['title'], rank=result.get('position', 0))

Reddit as a Leading Indicator

Reddit citations today correlate with LLM answer citations in 60 to 90 days. Ingest Reddit threads into the graph as a separate surface and the team can predict which entities will win generative visibility before they do.

Python

def ingest_reddit(query):
    r = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': query, 'platform': 'reddit'})

    with driver.session() as sess:
        for post in r.json().get('posts', []):
            sess.run("""
                MATCH (q:Query {text: $q})
                MERGE (e:Entity {url: $url})
                SET e.title = $title
                MERGE (q)-[:REDDIT_MENTION {score: $score}]->(e)
            """, q=query, url=post['url'], title=post['title'],
                score=post.get('score', 0))

The Useful Cypher Queries

Once the graph is populated, three Cypher queries do most of the real GEO work:

// Competitor gap analysis: entities cited for competitor queries but not ours
MATCH (brand:Entity {name: 'OurBrand'})
MATCH (comp:Entity {name: 'CompetitorBrand'})
MATCH (comp)<-[:CITED_IN_AO]-(q:Query)
WHERE NOT (brand)<-[:CITED_IN_AO]-(q)
RETURN q.text, count(*) as severity
ORDER BY severity DESC LIMIT 20;

// Reddit leading indicators: high Reddit signal, no AI citation yet
MATCH (e:Entity)<-[r:REDDIT_MENTION]-(q:Query)
WHERE NOT (e)<-[:CITED_IN_AO]-(q)
RETURN e.title, sum(r.score) as reddit_score
ORDER BY reddit_score DESC LIMIT 20;

// Topical authority map: who cites us across which surfaces?
MATCH (c:Citer)-[r]->(e:Entity {name: 'OurBrand'})
RETURN c.name, type(r) as surface, count(*) as citations
ORDER BY citations DESC;

Why This Works for eCommerce Specifically

eCommerce brands live in entity-rich categories. A product has a name, category, attributes, competitors, and a price. The graph captures all of it, and generative engines retrieve eCommerce queries heavily by entity. A site that surfaces its products as first-class graph entities wins the citation battle versus a site that only has keyword pages.

Operational Cost

For a 500-product catalog with 20 related queries each, the daily enrichment cost on Scavio is roughly 10,000 queries, which fits the $30/mo plan with room to spare. Neo4j AuraDB free tier hosts the graph. The full stack lands under $50/mo before the team adds an LLM for composition work.

Pair with the best API for Neo4j GEO pipelines comparison when picking the ingestion layer.

Why Entity Graphs Beat Keywords for GEO

The Schema

Four node labels cover most GEO use cases:

Entity: brand, product, topic, concept.
Citer: publication, Reddit author, YouTube channel, influencer.
Surface: Google AI Overviews, Perplexity, ChatGPT citations, Reddit thread.
Query: the user intent string that reaches a generative engine.

Edges carry weight and timestamp:

{`(Citer)-[:CITES {weight, first_seen, last_seen}]->(Entity)`}
(Query)-[:RETURNS]->(Surface)-[:LISTS]->(Entity)
(Entity)-[:COMPETES_WITH]->(Entity)

Ingestion with Scavio

Python

import os, requests
from neo4j import GraphDatabase

API_KEY = os.environ['SCAVIO_API_KEY']
driver = GraphDatabase.driver('bolt://localhost:7687',
    auth=('neo4j', os.environ['NEO4J_PASSWORD']))

def ingest(query):
    r = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': query, 'include_ai_overview': True})
    data = r.json()

    with driver.session() as sess:
        # Create Query node
        sess.run("MERGE (q:Query {text: $q})", q=query)

        # AI Overview citations
        ao = data.get('ai_overview', {})
        for citation in ao.get('citations', []):
            sess.run("""
                MATCH (q:Query {text: $q})
                MERGE (e:Entity {url: $url})
                SET e.title = $title
                MERGE (q)-[:CITED_IN_AO {surface: 'ai_overviews'}]->(e)
            """, q=query, url=citation['url'], title=citation.get('title', ''))

        # Organic results
        for result in data.get('organic_results', [])[:10]:
            sess.run("""
                MATCH (q:Query {text: $q})
                MERGE (e:Entity {url: $url})
                SET e.title = $title
                MERGE (q)-[:SERP_RESULT {rank: $rank}]->(e)
            """, q=query, url=result['link'],
                title=result['title'], rank=result.get('position', 0))

Reddit as a Leading Indicator

Python

def ingest_reddit(query):
    r = requests.post('https://api.scavio.dev/api/v1/search',
        headers={'x-api-key': API_KEY},
        json={'query': query, 'platform': 'reddit'})

    with driver.session() as sess:
        for post in r.json().get('posts', []):
            sess.run("""
                MATCH (q:Query {text: $q})
                MERGE (e:Entity {url: $url})
                SET e.title = $title
                MERGE (q)-[:REDDIT_MENTION {score: $score}]->(e)
            """, q=query, url=post['url'], title=post['title'],
                score=post.get('score', 0))

The Useful Cypher Queries

Once the graph is populated, three Cypher queries do most of the real GEO work:

// Competitor gap analysis: entities cited for competitor queries but not ours
MATCH (brand:Entity {name: 'OurBrand'})
MATCH (comp:Entity {name: 'CompetitorBrand'})
MATCH (comp)<-[:CITED_IN_AO]-(q:Query)
WHERE NOT (brand)<-[:CITED_IN_AO]-(q)
RETURN q.text, count(*) as severity
ORDER BY severity DESC LIMIT 20;

// Reddit leading indicators: high Reddit signal, no AI citation yet
MATCH (e:Entity)<-[r:REDDIT_MENTION]-(q:Query)
WHERE NOT (e)<-[:CITED_IN_AO]-(q)
RETURN e.title, sum(r.score) as reddit_score
ORDER BY reddit_score DESC LIMIT 20;

// Topical authority map: who cites us across which surfaces?
MATCH (c:Citer)-[r]->(e:Entity {name: 'OurBrand'})
RETURN c.name, type(r) as surface, count(*) as citations
ORDER BY citations DESC;

Why This Works for eCommerce Specifically

Operational Cost

Pair with the best API for Neo4j GEO pipelines comparison when picking the ingestion layer.

Neo4j Knowledge Graphs for Generative Engine Optimization

Why Entity Graphs Beat Keywords for GEO

The Schema

Ingestion with Scavio

Reddit as a Leading Indicator

The Useful Cypher Queries

Why This Works for eCommerce Specifically

Operational Cost

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters

Neo4j Knowledge Graphs for Generative Engine Optimization

Why Entity Graphs Beat Keywords for GEO

The Schema

Ingestion with Scavio

Reddit as a Leading Indicator

The Useful Cypher Queries

Why This Works for eCommerce Specifically

Operational Cost

Continue reading

AEO Tracking for D2C Ecommerce Brands in 2026

Agent Discovery vs Extraction: Why Cost Split Matters