Microsoft's original GraphRAG paper showed a 40% reduction in hallucination rate on multi-hop questions compared to naive vector RAG on the same corpus. That gap comes from how the two approaches are structured: vector RAG retrieves chunks by similarity, while GraphRAG extracts relationships, builds a knowledge graph, and can answer questions that span multiple documents. The trade-off is indexing cost. Every source chunk requires 4-6 LLM calls for entity and claim extraction, making GraphRAG indexing 20-100x more expensive than embedding alone.
This guide covers the full deployment pipeline on GPU cloud infrastructure: what to provision, how to configure entity extraction with vLLM, where community detection fits, and what graph storage looks like in production. For the broader RAG infrastructure context and GPU memory planning, the agentic RAG GPU infrastructure guide covers stack co-location in depth. For a cost baseline comparing self-hosted vs managed RAG APIs, see the RAG pipeline bare metal case study.
The key infrastructure insight: entity extraction, not retrieval, is where the GPU work sits. Standard vector RAG has one GPU-intensive step (embedding the corpus). GraphRAG has three: entity extraction (one 70B LLM call per chunk), hierarchical summarization (another LLM pass per community), and query-time generation. This changes how you size and cost the infrastructure compared to a standard RAG stack.
GraphRAG vs Vector RAG vs Agentic RAG
| Architecture | Query Strength | Indexing Cost | Infra Complexity | Best For |
|---|---|---|---|---|
| Vector RAG | Per-chunk similarity | Low (embedding only) | Low | Document search, Q&A on specific passages |
| GraphRAG | Multi-hop, global summarization | High (LLM per chunk) | Medium | Cross-document reasoning, thematic queries |
| Agentic RAG | Dynamic, tool-augmented | Medium (retrieval + agent loops) | High | Complex workflows, tool use, multi-step reasoning |
Agentic RAG covers GPU memory planning, vector search co-location, and sub-200ms TTFT targets in full.
GraphRAG Pipeline Anatomy
Entity and Claim Extraction
This is the GPU bottleneck. For each chunk in your source corpus, GraphRAG sends it to an LLM with a prompt asking it to extract named entities (people, places, organizations, concepts) and the relationships between them. The default configuration uses 4-6 LLM calls per 512-token chunk: one pass for entity extraction, one for claim extraction, and additional passes for disambiguation and relationship classification.
At Llama 3.1 70B FP8, each call processes 1000-2000 input tokens. For a 1M token corpus split into 2000 chunks, you're looking at 8,000-12,000 LLM calls. At 3000 tok/sec throughput on a single H100 SXM5, that runs in 22-33 minutes of compute time.
Graph Construction
After extraction, GraphRAG converts entity/relationship pairs into a property graph. Each entity becomes a node with attributes (type, description, source chunk). Each relationship becomes a directed edge with a weight derived from co-occurrence frequency. The resulting graph has between 5x and 20x more nodes than source chunks, depending on entity density in your corpus.
Community Detection
Community detection is CPU-bound. GraphRAG uses the Leiden algorithm (implemented in the graspologic library) to cluster nodes into hierarchically nested communities: small communities of closely related entities nest inside larger topical clusters. The output is a tree of communities at multiple granularities, from fine-grained entity clusters to high-level topic groups.
For a corpus of 1M tokens, Leiden typically runs in 2-10 minutes on a 32-core CPU node. The output feeds directly into the next stage.
Hierarchical Summarization
For each community at each level of the hierarchy, GraphRAG runs another LLM pass to generate a natural language summary. These summaries are what enable global search queries. A query like "what are the main themes in this document collection?" retrieves community summaries at the top level of the hierarchy, bypassing the per-chunk index entirely.
This stage consumes roughly 30-50% as much GPU compute as entity extraction, since community count is much lower than chunk count.
Retrieval: Global vs Local Search
GraphRAG has two query modes. Global search operates on community summaries: it identifies the most relevant communities for a query, retrieves their summaries, and generates a response. This is the mode that enables cross-document reasoning and thematic questions.
Local search operates more like standard RAG: it retrieves the most relevant entities and their local neighborhood from the knowledge graph, combines them with source text chunks, and generates a focused answer. Local search is faster and cheaper per query; global search is more thorough but burns more LLM tokens.
GPU Compute Breakdown by Pipeline Stage
| Stage | Compute Type | VRAM Needed | Recommended Spheron SKU | Notes |
|---|---|---|---|---|
| Entity Extraction | GPU (LLM) | 70-140GB | H100 SXM5 (80GB, FP8) or H200 (141GB, BF16) | Main bottleneck, parallelizable |
| Embedding | GPU | 4-16GB | L40S, RTX PRO 6000, or A100 PCIe | Lightweight vs extraction |
| Community Detection | CPU | N/A | CPU-only node (16+ cores) | Leiden via graspologic |
| Hierarchical Summarization | GPU (LLM) | Same as extraction | Same instance as extraction | 30-50% of extraction cost |
| Retrieval Index Lookup | CPU/GPU | Depends on graph DB | Neo4j: CPU; Kuzu: embedded | Query-time, not indexing |
Hardware Requirements: Entity Extraction at Scale
VRAM requirements for the extraction LLM:
| Model | Precision | VRAM | Single GPU Option |
|---|---|---|---|
| Llama 3.1 70B | BF16 | ~140GB | H200 SXM5 (141GB) |
| Llama 3.1 70B | FP8 | ~70GB | H100 SXM5 (80GB) |
| Llama 3.1 8B | BF16 | ~16GB | A100 80GB PCIe |
| Llama 3.1 8B | FP8 | ~8GB | RTX 4090 (24GB) |
At 70B FP8 on a single H100 SXM5, expect 2500-3500 tokens/sec throughput for extraction prompts. At 8B BF16, throughput jumps to 10,000+ tokens/sec, at the cost of lower extraction quality.
The decision is straightforward: use 70B for production corpora where multi-hop accuracy matters. Use 8B for prototyping or very high-volume batch jobs where speed matters more than extraction precision.
For 2x H100 with tensor parallelism (tp=2), throughput roughly doubles and VRAM headroom opens up for BF16 70B. The --tensor-parallel-size 2 flag in vLLM handles the sharding automatically. For detailed VRAM planning across model sizes, the GPU memory requirements guide has a complete calculator.
For a single-GPU BF16 70B deployment without tensor parallelism, on-demand H200 instances on Spheron (141GB HBM3e) are the clean fit.
Alternative GraphRAG Stacks
Not all GraphRAG deployments use Microsoft's reference implementation. Several alternative frameworks have emerged with different trade-offs:
| Stack | Indexing Speed | Graph DB Support | Production Readiness | Best For |
|---|---|---|---|---|
| Microsoft GraphRAG | Slower (reference impl) | Parquet/CSV (native), Neo4j via plugin | High, well-documented | Standard deployments |
| LightRAG | Fast (vector+graph hybrid) | Neo4j, Milvus, Qdrant | Medium | Hybrid retrieval |
| nano-graphrag | Fast (minimal impl) | In-memory, pluggable | Low | Self-hosted/custom |
| neo4j-graphrag | Medium | Neo4j native | High | Cypher-first pipelines |
| LlamaIndex PropertyGraphIndex | Medium | Multiple backends | High | LlamaIndex workflows |
LightRAG
LightRAG combines a graph structure with vector retrieval, using a simpler entity extraction pass than Microsoft GraphRAG. Indexing is faster because it runs fewer LLM calls per chunk. The trade-off is that the graph structure is less rich: LightRAG builds entity-relationship pairs but does not run full community detection. For corpora where global summarization matters less than fast indexing and hybrid search, LightRAG is worth benchmarking against the Microsoft implementation.
nano-graphrag
A minimal Python reimplementation of the GraphRAG pipeline in a single file. No external dependencies beyond an LLM API. Storage is in-memory by default (switchable to disk). Not production-ready for large corpora, but ideal for understanding the pipeline mechanics or building a custom implementation on top of.
neo4j-graphrag
The official Neo4j Python package for GraphRAG workflows. It provides a production-grade Cypher query layer on top of the knowledge graph, full Neo4j ACID transaction support, and integration with existing Neo4j deployments. If your organization already runs Neo4j, this is the lowest-friction path.
LlamaIndex PropertyGraphIndex
LlamaIndex's GraphRAG implementation integrates directly into LlamaIndex pipelines and supports multiple graph backends including Neo4j, Nebula, and in-memory graphs. If you're already using LlamaIndex for other RAG components, PropertyGraphIndex lets you add graph retrieval without a separate framework.
Step-by-Step: Deploy GraphRAG on Spheron
Step 1: Provision GPU Instances
You need at least two instances for a minimal GraphRAG deployment:
- Extraction GPU (for entity extraction and hierarchical summarization): Spheron H100 instances at the H100 SXM5 80GB tier handle Llama 3.1 70B FP8 on a single GPU. For BF16 or for higher concurrency, use H200 SXM5 or 2x H100 with tensor parallelism.
- Embedding GPU (for document and query embedding): A100 80GB PCIe or RTX PRO 6000. The embedding model is much smaller than the extraction LLM, so any GPU with 16GB+ VRAM works.
- Graph storage node (optional, CPU only): Neo4j or Kuzu run on a separate CPU instance. Kuzu is embedded, so it can run on the same instance as the extraction LLM if VRAM is not a constraint.
Note that H100 and H200 instances on Spheron are on-demand only. If you're running 8B-class extraction on the A100 80GB PCIe, spot pricing is available and GraphRAG's per-document checkpointing means a spot interruption only loses the in-flight batch.
Step 2: Deploy vLLM for Entity Extraction
SSH into the H100 or H200 instance:
pip install vllmFor a single H100 SXM5 with FP8 quantization:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--dtype fp8 \
--tensor-parallel-size 1 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85For 2x H100 with tensor parallelism (BF16 or higher throughput):
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85For multi-GPU tensor parallelism tuning and production vLLM configuration, see the vLLM production deployment guide.
The --max-model-len 8192 is intentional. Entity extraction prompts are short (1-2K input tokens), and setting a lower max model length frees VRAM for the KV cache, letting you run more concurrent extraction requests.
Step 3: Deploy the Embedding Model
On the embedding GPU instance, deploy with Text Embeddings Inference:
docker run --gpus all \
-p 8001:80 \
ghcr.io/huggingface/text-embeddings-inference:latest \
--model-id BAAI/bge-large-en-v1.5 \
--dtype float16Or with vLLM in embedding mode:
vllm serve BAAI/bge-large-en-v1.5 \
--task embed \
--port 8001For a full comparison of TEI vs vLLM embedding serving and model selection guidance, see the self-hosted embeddings and rerankers guide.
Step 4: Set Up Graph Storage
Neo4j:
docker run \
-p 7474:7474 \
-p 7687:7687 \
-e NEO4J_AUTH=neo4j/yourpassword \
-v $(pwd)/neo4j/data:/data \
neo4j:latestKuzu (embedded, no server):
pip install kuzuKuzu stores the graph on disk in a directory you specify in settings.yaml. No server process required. For single-node deployments where you don't need Cypher or graph traversal at scale, Kuzu is the simpler option.
Step 5: Configure Microsoft GraphRAG
Install and initialize (pin to a specific release to avoid schema drift - see edge cases below):
pip install graphrag==0.5.0
graphrag init --root ./ragproject0.5.0 is the last release with the settings.yaml schema this guide targets; 1.x and later restructure config. Check the upstream changelog before upgrading.
Edit ./ragproject/settings.yaml:
llm:
api_base: http://<h100-ip>:8000/v1
model: meta-llama/Llama-3.1-70B-Instruct
api_key: placeholder # vLLM ignores this, but the field is required
concurrent_requests: 16 # increase from default 4 for H100 throughput
embeddings:
llm:
api_base: http://<embedding-ip>:8001/v1
model: BAAI/bge-large-en-v1.5
api_key: placeholder
chunks:
size: 512
overlap: 100
storage:
type: file # or neo4j / kuzu depending on your setupThe settings.yaml schema changes between GraphRAG releases. Pin to graphrag==0.5.0 or your tested version and check the upstream changelog before upgrading.
Step 6: Run Indexing
Place your source documents in ./ragproject/input/ (plain text or Markdown files). Then run:
graphrag index --root ./ragprojectMonitor GPU utilization on the extraction instance:
nvidia-smi -l 1You should see GPU utilization above 80% during entity extraction. If it's low, increase llm.concurrent_requests in settings.yaml and restart the indexing run.
GraphRAG checkpoints per document. If the run is interrupted (spot instance reclaimed, OOM, etc.), restart with the same command. Already-processed documents are skipped.
Step 7: Query with Global and Local Search
Global search (thematic, cross-document questions):
graphrag query \
--root ./ragproject \
--method global \
--query "What are the main topics covered in this document collection?"Local search (entity-specific, factual questions):
graphrag query \
--root ./ragproject \
--method local \
--query "What did the board say about the Q3 budget?"For production applications, use the Python API instead of the CLI:
import asyncio
from graphrag.query.api import global_search
async def main():
results = await global_search(
config=your_config,
nodes=community_nodes_df,
entities=entity_df,
community_reports=community_report_df,
community_level=2,
response_type="multiple paragraphs",
query="What are the main themes?",
)
print(results)
asyncio.run(main())Performance Tuning
Batching entity extraction: The default llm.concurrent_requests in GraphRAG settings is 4. On an H100 with FP8 quantization, you can push this to 16-32 before the extraction LLM becomes the bottleneck. Add this to settings.yaml:
llm:
concurrent_requests: 16
request_timeout: 180.0
max_retries: 3KV cache for extraction: Extraction prompts are short. Lowering --max-model-len in vLLM frees VRAM for the KV cache, allowing more concurrent requests. --max-model-len 8192 is a good default for extraction workloads. Drop to 4096 if you need more concurrency and your prompts are consistently short.
Parallelizing community detection: The graspologic Leiden implementation uses Python multiprocessing. Set LOKY_MAX_CPU_COUNT to your CPU node's core count before running indexing:
export LOKY_MAX_CPU_COUNT=32
graphrag index --root ./ragprojectCheckpointing: GraphRAG writes a checkpoint file after each successfully processed document. Safe to kill and restart without full re-index. For spot instances, implement a simple wrapper script that monitors for instance preemption and restarts indexing automatically after the instance comes back up.
Cost Analysis: GraphRAG Indexing and Query
Indexing Cost per Million Source Tokens
| Model | GPU SKU | On-Demand Price | Throughput | LLM tokens / source token | Time | Estimated Cost |
|---|---|---|---|---|---|---|
| Llama 3.1 70B FP8 | H100 SXM5 | $4.34/hr | ~3,000 tok/sec | 4-6x | ~22-33 min | ~$1.60-$2.41 |
| Llama 3.1 8B BF16 | A100 80GB PCIe | $1.69/hr | ~10,000 tok/sec | 2-4x | ~5-10 min | ~$0.14-$0.28 |
Community detection (Leiden, CPU) and hierarchical summarization add roughly 30-50% to the extraction cost. For a 1M source token corpus with 70B FP8 extraction, total indexing cost (extraction + summarization) is roughly $2.50-$3.60 on H100 on-demand.
The 8B model indexes 4-5x faster and costs 10-15x less, but extraction quality drops noticeably on ambiguous entity mentions and implicit relationships. Use 8B for prototyping and corpus exploration; 70B for production knowledge bases where accuracy matters.
Per-Query Inference Cost
Global search retrieves 3-5 community summaries per query (each 200-500 tokens) and generates a synthesis response. At H100 on-demand rates ($4.34/hr) and 3000 tok/sec throughput, a global search query consuming 2000 total tokens costs roughly $0.0008. At high query volume (1M queries/month), that is about $800/month in H100 compute time.
Local search is cheaper: it retrieves 5-10 entity neighborhood chunks and runs a single generation pass. Cost is similar to standard RAG at the same GPU tier.
For broader inference economics and unit cost comparisons across GPU tiers, see AI inference cost economics 2026.
Real-Time vs Batch GraphRAG
Batch (pre-indexed): Build the full knowledge graph offline, query against the static index. Best for static corpora: compliance documents, product manuals, research papers, legal contracts. The index is stable; rebuild it when the corpus changes.
Incremental updates: GraphRAG 0.5.0 and later supports adding new documents to an existing index without a full re-index. Enable incremental mode in settings.yaml:
update_index_storage:
type: file
base_dir: ./ragproject/output/incrementalGraph merging is the expensive step: new entities must be matched and linked to existing nodes. For frequently updated corpora, plan for a daily or weekly incremental run rather than real-time updates.
Real-time is not viable at the indexing level for GraphRAG. The entity extraction latency (seconds per document) makes it unsuitable for streaming ingestion. The practical pattern is hybrid: use vector RAG (standard embeddings) for real-time document ingestion, and the GraphRAG knowledge graph for pre-indexed knowledge. Queries that need cross-document reasoning hit the graph index; queries that need recent information hit the vector index.
For spot vs on-demand trade-offs on batch indexing workloads, see the serverless GPU vs on-demand vs reserved guide.
Live Spheron Pricing for GraphRAG Workloads
| GPU | VRAM | On-Demand | Spot | Best For |
|---|---|---|---|---|
| H100 SXM5 | 80GB | $4.34/hr | not available | 70B FP8 entity extraction (primary) |
| H200 SXM5 | 141GB | $2.51/hr | not available | 70B BF16 extraction, single GPU |
| A100 80GB PCIe | 80GB | $1.69/hr | $1.14/hr | Cost-effective 8B extraction, embedding |
| L40S | 48GB | $0.72/hr | not available | Embedding generation, 8B extraction |
Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.
GraphRAG's GPU cost sits almost entirely in entity extraction. The pipeline is batch-tolerant and checkpointed per document, so interrupted runs only lose the in-flight batch. H100 and H200 instances on Spheron are on-demand only. For 8B-class extraction jobs using the A100 80GB PCIe, spot pricing at $1.14/hr offers real savings on long indexing runs. On-demand instances handle live query serving.
H100 pricing on Spheron → | On-demand H200 → | View all GPU pricing →
Quick Setup Guide
Entity extraction needs an H100 SXM5 (80GB) or 2x H100 for 70B FP8 models, or a single H200 (141GB) for BF16. Embedding needs an L40S (48GB) or RTX PRO 6000 (96GB). Graph storage (Neo4j or Kuzu) runs on a CPU-only node. H100 and H200 instances are on-demand only; for 8B-class extraction jobs where lower precision is acceptable, the A100 80GB PCIe has spot pricing available for additional cost savings.
SSH into the H100 or H200 instance. Install vLLM: pip install vllm. Launch the entity extraction LLM: vllm serve meta-llama/Llama-3.1-70B-Instruct --dtype fp8 --tensor-parallel-size 1 --port 8000 --max-model-len 8192 --gpu-memory-utilization 0.85. For 2x H100 with tensor parallelism: add --tensor-parallel-size 2.
On the embedding GPU instance (L40S or RTX PRO 6000), run TEI: docker run --gpus all -p 8001:80 ghcr.io/huggingface/text-embeddings-inference:latest --model-id BAAI/bge-large-en-v1.5 --dtype float16. Or use vLLM in embedding mode: vllm serve BAAI/bge-large-en-v1.5 --task embed --port 8001.
Neo4j: docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/yourpassword neo4j:latest. Kuzu: pip install kuzu; graph storage is embedded, no server required. Set the corresponding environment variable in GraphRAG's settings.yaml: storage.type: neo4j or storage.type: kuzu.
Install: pip install graphrag. Initialize: graphrag init --root ./ragproject. Edit settings.yaml: set llm.api_base to your vLLM endpoint (http://<h100-ip>:8000/v1), llm.model to meta-llama/Llama-3.1-70B-Instruct, embeddings.llm.api_base to your TEI endpoint (http://<l40s-ip>:8001/v1), embeddings.llm.model to BAAI/bge-large-en-v1.5. Set llm.api_key to any non-empty string (vLLM ignores it).
Place source documents in ./ragproject/input/. Run indexing: graphrag index --root ./ragproject. GraphRAG will chunk text, extract entities and claims (multiple LLM calls per chunk), build the graph, run Leiden community detection, and generate hierarchical summaries. Monitor GPU utilization with nvidia-smi -l 1. For large corpora, increase chunk_size and overlap in settings.yaml and adjust llm.concurrent_requests for throughput.
Global search (for thematic, summary-level questions): graphrag query --root ./ragproject --method global --query 'What are the main topics covered?' Local search (for entity-specific questions): graphrag query --root ./ragproject --method local --query 'What did Alice say about the project budget?'. Wire these endpoints into your application via the graphrag Python API or the built-in CLI.
Frequently Asked Questions
Vector RAG embeds documents into a flat vector index and retrieves chunks by similarity. GraphRAG first extracts entities and relationships from your corpus, builds a knowledge graph, runs community detection (typically the Leiden algorithm), and generates hierarchical summaries. At query time it can answer global questions ('what are the main themes across this corpus?') that vector RAG cannot - because vector search has no concept of document-level structure or cross-document relationships. The trade-off is that GraphRAG indexing is 20-100x more expensive than embedding: every source chunk requires multiple LLM calls for entity extraction.
Entity extraction is the GPU bottleneck. Microsoft GraphRAG's default configuration uses a 70B-class LLM for extraction, which requires ~140GB VRAM in BF16 or ~70GB in FP8. A single H100 SXM5 (80GB) with FP8 quantization handles a 70B extraction model. For BF16 or for higher throughput, two H100s with tensor parallelism (tp=2) or a single H200 (141GB) are the right sizing. The embedding stage is much lighter: an L40S (48GB) or RTX PRO 6000 (96GB) is sufficient for text-embedding-3 class models.
Partially, depending on which GPU you use. GraphRAG indexing is batch-tolerant and checkpointed per document, so a spot interruption only loses the in-flight batch. H100 and H200 spot instances are not currently available on Spheron - those GPUs are on-demand only. If you use the A100 80GB PCIe for 8B-class extraction, spot pricing is available at $1.14/hr vs $1.69/hr on-demand, a meaningful saving for long batch runs. For 70B extraction on H100 or H200, plan for on-demand rates. Run retrieval and generation on on-demand instances to guarantee query availability.
Cost depends on extraction LLM, parallelism, and corpus density. As a baseline: Microsoft GraphRAG with default settings (Llama 3.1 70B FP8 extraction, 300-token chunk overlap) runs roughly 4-6 LLM calls per 512-token chunk - about 2000-3000 input tokens per source chunk including system prompt and extracted context. At ~4x token expansion, indexing 1M source tokens requires processing 4-6M LLM input tokens. On an H100 at roughly 3000 tok/sec throughput (70B FP8, vLLM), that is 22-33 minutes of GPU time, or $1.60-$2.41 at on-demand H100 rates ($4.34/hr). Community detection and summarization add another 30-50% on top.
Neo4j is the most battle-tested option and has an official Python GraphRAG integration (neo4j-graphrag-python). Kuzu is a high-performance embedded graph database that is simpler to operate (no separate server process) and works well for single-node setups. For large-scale production with graph traversal queries across tens of millions of nodes, Neo4j's native Cypher engine is faster. For mid-scale deployments where operational simplicity matters more than query throughput, Kuzu requires less infrastructure. Both run on any Spheron CPU or GPU instance with Docker.
