Deploy GraphRAG on GPU Cloud: Knowledge Graph Construction and LLM Inference Pipeline (2026 Guide)

Microsoft's original GraphRAG paper showed a 40% reduction in hallucination rate on multi-hop questions compared to naive vector RAG on the same corpus. That gap comes from how the two approaches are structured: vector RAG retrieves chunks by similarity, while GraphRAG extracts relationships, builds a knowledge graph, and can answer questions that span multiple documents. The trade-off is indexing cost. Every source chunk requires 4-6 LLM calls for entity and claim extraction, making GraphRAG indexing 20-100x more expensive than embedding alone.

This guide covers the full deployment pipeline on GPU cloud infrastructure: what to provision, how to configure entity extraction with vLLM, where community detection fits, and what graph storage looks like in production. For the broader RAG infrastructure context and GPU memory planning, the agentic RAG GPU infrastructure guide covers stack co-location in depth. For a cost baseline comparing self-hosted vs managed RAG APIs, see the RAG pipeline bare metal case study.

The key infrastructure insight: entity extraction, not retrieval, is where the GPU work sits. Standard vector RAG has one GPU-intensive step (embedding the corpus). GraphRAG has three: entity extraction (one 70B LLM call per chunk), hierarchical summarization (another LLM pass per community), and query-time generation. This changes how you size and cost the infrastructure compared to a standard RAG stack.

GraphRAG vs Vector RAG vs Agentic RAG

Architecture	Query Strength	Indexing Cost	Infra Complexity	Best For
Vector RAG	Per-chunk similarity	Low (embedding only)	Low	Document search, Q&A on specific passages
GraphRAG	Multi-hop, global summarization	High (LLM per chunk)	Medium	Cross-document reasoning, thematic queries
Agentic RAG	Dynamic, tool-augmented	Medium (retrieval + agent loops)	High	Complex workflows, tool use, multi-step reasoning

Agentic RAG covers GPU memory planning, vector search co-location, and sub-200ms TTFT targets in full.

GraphRAG Pipeline Anatomy

Entity and Claim Extraction

This is the GPU bottleneck. For each chunk in your source corpus, GraphRAG sends it to an LLM with a prompt asking it to extract named entities (people, places, organizations, concepts) and the relationships between them. The default configuration uses 4-6 LLM calls per 512-token chunk: one pass for entity extraction, one for claim extraction, and additional passes for disambiguation and relationship classification.

At Llama 3.1 70B FP8, each call processes 1000-2000 input tokens. For a 1M token corpus split into 2000 chunks, you're looking at 8,000-12,000 LLM calls. At 3000 tok/sec throughput on a single H100 SXM5, that runs in 22-33 minutes of compute time.

Graph Construction

After extraction, GraphRAG converts entity/relationship pairs into a property graph. Each entity becomes a node with attributes (type, description, source chunk). Each relationship becomes a directed edge with a weight derived from co-occurrence frequency. The resulting graph has between 5x and 20x more nodes than source chunks, depending on entity density in your corpus.

Community Detection

Community detection is CPU-bound. GraphRAG uses the Leiden algorithm (implemented in the graspologic library) to cluster nodes into hierarchically nested communities: small communities of closely related entities nest inside larger topical clusters. The output is a tree of communities at multiple granularities, from fine-grained entity clusters to high-level topic groups.

For a corpus of 1M tokens, Leiden typically runs in 2-10 minutes on a 32-core CPU node. The output feeds directly into the next stage.

Hierarchical Summarization

For each community at each level of the hierarchy, GraphRAG runs another LLM pass to generate a natural language summary. These summaries are what enable global search queries. A query like "what are the main themes in this document collection?" retrieves community summaries at the top level of the hierarchy, bypassing the per-chunk index entirely.

This stage consumes roughly 30-50% as much GPU compute as entity extraction, since community count is much lower than chunk count.

Retrieval: Global vs Local Search

GraphRAG has two query modes. Global search operates on community summaries: it identifies the most relevant communities for a query, retrieves their summaries, and generates a response. This is the mode that enables cross-document reasoning and thematic questions.

Local search operates more like standard RAG: it retrieves the most relevant entities and their local neighborhood from the knowledge graph, combines them with source text chunks, and generates a focused answer. Local search is faster and cheaper per query; global search is more thorough but burns more LLM tokens.

GPU Compute Breakdown by Pipeline Stage

Stage	Compute Type	VRAM Needed	Recommended Spheron SKU	Notes
Entity Extraction	GPU (LLM)	70-140GB	H100 SXM5 (80GB, FP8) or H200 (141GB, BF16)	Main bottleneck, parallelizable
Embedding	GPU	4-16GB	L40S, RTX PRO 6000, or A100 PCIe	Lightweight vs extraction
Community Detection	CPU	N/A	CPU-only node (16+ cores)	Leiden via graspologic
Hierarchical Summarization	GPU (LLM)	Same as extraction	Same instance as extraction	30-50% of extraction cost
Retrieval Index Lookup	CPU/GPU	Depends on graph DB	Neo4j: CPU; Kuzu: embedded	Query-time, not indexing

Hardware Requirements: Entity Extraction at Scale

VRAM requirements for the extraction LLM:

Model	Precision	VRAM	Single GPU Option
Llama 3.1 70B	BF16	~140GB	H200 SXM5 (141GB)
Llama 3.1 70B	FP8	~70GB	H100 SXM5 (80GB)
Llama 3.1 8B	BF16	~16GB	A100 80GB PCIe
Llama 3.1 8B	FP8	~8GB	RTX 4090 (24GB)

At 70B FP8 on a single H100 SXM5, expect 2500-3500 tokens/sec throughput for extraction prompts. At 8B BF16, throughput jumps to 10,000+ tokens/sec, at the cost of lower extraction quality.

The decision is straightforward: use 70B for production corpora where multi-hop accuracy matters. Use 8B for prototyping or very high-volume batch jobs where speed matters more than extraction precision.

For 2x H100 with tensor parallelism (tp=2), throughput roughly doubles and VRAM headroom opens up for BF16 70B. The --tensor-parallel-size 2 flag in vLLM handles the sharding automatically. For detailed VRAM planning across model sizes, the GPU memory requirements guide has a complete calculator.

For a single-GPU BF16 70B deployment without tensor parallelism, on-demand H200 instances on Spheron (141GB HBM3e) are the clean fit.

Alternative GraphRAG Stacks

Not all GraphRAG deployments use Microsoft's reference implementation. Several alternative frameworks have emerged with different trade-offs:

Stack	Indexing Speed	Graph DB Support	Production Readiness	Best For
Microsoft GraphRAG	Slower (reference impl)	Parquet/CSV (native), Neo4j via plugin	High, well-documented	Standard deployments
LightRAG	Fast (vector+graph hybrid)	Neo4j, Milvus, Qdrant	Medium	Hybrid retrieval
nano-graphrag	Fast (minimal impl)	In-memory, pluggable	Low	Self-hosted/custom
neo4j-graphrag	Medium	Neo4j native	High	Cypher-first pipelines
LlamaIndex PropertyGraphIndex	Medium	Multiple backends	High	LlamaIndex workflows

LightRAG

LightRAG combines a graph structure with vector retrieval, using a simpler entity extraction pass than Microsoft GraphRAG. Indexing is faster because it runs fewer LLM calls per chunk. The trade-off is that the graph structure is less rich: LightRAG builds entity-relationship pairs but does not run full community detection. For corpora where global summarization matters less than fast indexing and hybrid search, LightRAG is worth benchmarking against the Microsoft implementation.

nano-graphrag

A minimal Python reimplementation of the GraphRAG pipeline in a single file. No external dependencies beyond an LLM API. Storage is in-memory by default (switchable to disk). Not production-ready for large corpora, but ideal for understanding the pipeline mechanics or building a custom implementation on top of.

neo4j-graphrag

The official Neo4j Python package for GraphRAG workflows. It provides a production-grade Cypher query layer on top of the knowledge graph, full Neo4j ACID transaction support, and integration with existing Neo4j deployments. If your organization already runs Neo4j, this is the lowest-friction path.

LlamaIndex PropertyGraphIndex

LlamaIndex's GraphRAG implementation integrates directly into LlamaIndex pipelines and supports multiple graph backends including Neo4j, Nebula, and in-memory graphs. If you're already using LlamaIndex for other RAG components, PropertyGraphIndex lets you add graph retrieval without a separate framework.

Step-by-Step: Deploy GraphRAG on Spheron

Step 1: Provision GPU Instances

You need at least two instances for a minimal GraphRAG deployment:

Extraction GPU (for entity extraction and hierarchical summarization): Spheron H100 instances at the H100 SXM5 80GB tier handle Llama 3.1 70B FP8 on a single GPU. For BF16 or for higher concurrency, use H200 SXM5 or 2x H100 with tensor parallelism.
Embedding GPU (for document and query embedding): A100 80GB PCIe or RTX PRO 6000. The embedding model is much smaller than the extraction LLM, so any GPU with 16GB+ VRAM works.
Graph storage node (optional, CPU only): Neo4j or Kuzu run on a separate CPU instance. Kuzu is embedded, so it can run on the same instance as the extraction LLM if VRAM is not a constraint.

Note that H100 and H200 instances on Spheron are on-demand only. If you're running 8B-class extraction on the A100 80GB PCIe, spot pricing is available and GraphRAG's per-document checkpointing means a spot interruption only loses the in-flight batch.

Step 2: Deploy vLLM for Entity Extraction

SSH into the H100 or H200 instance:

bash

pip install vllm

For a single H100 SXM5 with FP8 quantization:

bash

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --dtype fp8 \
  --tensor-parallel-size 1 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

For 2x H100 with tensor parallelism (BF16 or higher throughput):

bash

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

For multi-GPU tensor parallelism tuning and production vLLM configuration, see the vLLM production deployment guide.

The --max-model-len 8192 is intentional. Entity extraction prompts are short (1-2K input tokens), and setting a lower max model length frees VRAM for the KV cache, letting you run more concurrent extraction requests.

Step 3: Deploy the Embedding Model

On the embedding GPU instance, deploy with Text Embeddings Inference:

bash

docker run --gpus all \
  -p 8001:80 \
  ghcr.io/huggingface/text-embeddings-inference:latest \
  --model-id BAAI/bge-large-en-v1.5 \
  --dtype float16

Or with vLLM in embedding mode:

bash

vllm serve BAAI/bge-large-en-v1.5 \
  --task embed \
  --port 8001

For a full comparison of TEI vs vLLM embedding serving and model selection guidance, see the self-hosted embeddings and rerankers guide.

Step 4: Set Up Graph Storage

Neo4j:

bash

docker run \
  -p 7474:7474 \
  -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/yourpassword \
  -v $(pwd)/neo4j/data:/data \
  neo4j:latest

Kuzu (embedded, no server):

bash

pip install kuzu

Kuzu stores the graph on disk in a directory you specify in settings.yaml. No server process required. For single-node deployments where you don't need Cypher or graph traversal at scale, Kuzu is the simpler option.

Step 5: Configure Microsoft GraphRAG

Install and initialize (pin to a specific release to avoid schema drift - see edge cases below):

bash

pip install graphrag==0.5.0
graphrag init --root ./ragproject

0.5.0 is the last release with the settings.yaml schema this guide targets; 1.x and later restructure config. Check the upstream changelog before upgrading.

Edit ./ragproject/settings.yaml:

yaml

llm:
  api_base: http://<h100-ip>:8000/v1
  model: meta-llama/Llama-3.1-70B-Instruct
  api_key: placeholder  # vLLM ignores this, but the field is required
  concurrent_requests: 16  # increase from default 4 for H100 throughput

embeddings:
  llm:
    api_base: http://<embedding-ip>:8001/v1
    model: BAAI/bge-large-en-v1.5
    api_key: placeholder

chunks:
  size: 512
  overlap: 100

storage:
  type: file  # or neo4j / kuzu depending on your setup

The settings.yaml schema changes between GraphRAG releases. Pin to graphrag==0.5.0 or your tested version and check the upstream changelog before upgrading.

Step 6: Run Indexing

Place your source documents in ./ragproject/input/ (plain text or Markdown files). Then run:

bash

graphrag index --root ./ragproject

Monitor GPU utilization on the extraction instance:

bash

nvidia-smi -l 1

You should see GPU utilization above 80% during entity extraction. If it's low, increase llm.concurrent_requests in settings.yaml and restart the indexing run.

GraphRAG checkpoints per document. If the run is interrupted (spot instance reclaimed, OOM, etc.), restart with the same command. Already-processed documents are skipped.

Step 7: Query with Global and Local Search

Global search (thematic, cross-document questions):

bash

graphrag query \
  --root ./ragproject \
  --method global \
  --query "What are the main topics covered in this document collection?"

Local search (entity-specific, factual questions):

bash

graphrag query \
  --root ./ragproject \
  --method local \
  --query "What did the board say about the Q3 budget?"

For production applications, use the Python API instead of the CLI:

python

import asyncio
from graphrag.query.api import global_search

async def main():
    results = await global_search(
        config=your_config,
        nodes=community_nodes_df,
        entities=entity_df,
        community_reports=community_report_df,
        community_level=2,
        response_type="multiple paragraphs",
        query="What are the main themes?",
    )
    print(results)

asyncio.run(main())

Performance Tuning

Batching entity extraction: The default llm.concurrent_requests in GraphRAG settings is 4. On an H100 with FP8 quantization, you can push this to 16-32 before the extraction LLM becomes the bottleneck. Add this to settings.yaml:

yaml

llm:
  concurrent_requests: 16
  request_timeout: 180.0
  max_retries: 3

KV cache for extraction: Extraction prompts are short. Lowering --max-model-len in vLLM frees VRAM for the KV cache, allowing more concurrent requests. --max-model-len 8192 is a good default for extraction workloads. Drop to 4096 if you need more concurrency and your prompts are consistently short.

Parallelizing community detection: The graspologic Leiden implementation uses Python multiprocessing. Set LOKY_MAX_CPU_COUNT to your CPU node's core count before running indexing:

bash

export LOKY_MAX_CPU_COUNT=32
graphrag index --root ./ragproject

Checkpointing: GraphRAG writes a checkpoint file after each successfully processed document. Safe to kill and restart without full re-index. For spot instances, implement a simple wrapper script that monitors for instance preemption and restarts indexing automatically after the instance comes back up.

Cost Analysis: GraphRAG Indexing and Query

Indexing Cost per Million Source Tokens

Model	GPU SKU	On-Demand Price	Throughput	LLM tokens / source token	Time	Estimated Cost
Llama 3.1 70B FP8	H100 SXM5	$4.34/hr	~3,000 tok/sec	4-6x	~22-33 min	~$1.60-$2.41
Llama 3.1 8B BF16	A100 80GB PCIe	$1.69/hr	~10,000 tok/sec	2-4x	~5-10 min	~$0.14-$0.28

Community detection (Leiden, CPU) and hierarchical summarization add roughly 30-50% to the extraction cost. For a 1M source token corpus with 70B FP8 extraction, total indexing cost (extraction + summarization) is roughly $2.50-$3.60 on H100 on-demand.

The 8B model indexes 4-5x faster and costs 10-15x less, but extraction quality drops noticeably on ambiguous entity mentions and implicit relationships. Use 8B for prototyping and corpus exploration; 70B for production knowledge bases where accuracy matters.

Per-Query Inference Cost

Global search retrieves 3-5 community summaries per query (each 200-500 tokens) and generates a synthesis response. At H100 on-demand rates ($4.34/hr) and 3000 tok/sec throughput, a global search query consuming 2000 total tokens costs roughly $0.0008. At high query volume (1M queries/month), that is about $800/month in H100 compute time.

Local search is cheaper: it retrieves 5-10 entity neighborhood chunks and runs a single generation pass. Cost is similar to standard RAG at the same GPU tier.

For broader inference economics and unit cost comparisons across GPU tiers, see AI inference cost economics 2026.

Real-Time vs Batch GraphRAG

Batch (pre-indexed): Build the full knowledge graph offline, query against the static index. Best for static corpora: compliance documents, product manuals, research papers, legal contracts. The index is stable; rebuild it when the corpus changes.

Incremental updates: GraphRAG 0.5.0 and later supports adding new documents to an existing index without a full re-index. Enable incremental mode in settings.yaml:

yaml

update_index_storage:
  type: file
  base_dir: ./ragproject/output/incremental

Graph merging is the expensive step: new entities must be matched and linked to existing nodes. For frequently updated corpora, plan for a daily or weekly incremental run rather than real-time updates.

Real-time is not viable at the indexing level for GraphRAG. The entity extraction latency (seconds per document) makes it unsuitable for streaming ingestion. The practical pattern is hybrid: use vector RAG (standard embeddings) for real-time document ingestion, and the GraphRAG knowledge graph for pre-indexed knowledge. Queries that need cross-document reasoning hit the graph index; queries that need recent information hit the vector index.

For spot vs on-demand trade-offs on batch indexing workloads, see the serverless GPU vs on-demand vs reserved guide.

Live Spheron Pricing for GraphRAG Workloads

GPU	VRAM	On-Demand	Spot	Best For
H100 SXM5	80GB	$4.34/hr	not available	70B FP8 entity extraction (primary)
H200 SXM5	141GB	$2.51/hr	not available	70B BF16 extraction, single GPU
A100 80GB PCIe	80GB	$1.69/hr	$1.14/hr	Cost-effective 8B extraction, embedding
L40S	48GB	$0.72/hr	not available	Embedding generation, 8B extraction

Pricing fluctuates based on GPU availability. The prices above are based on 06 May 2026 and may have changed. Check current GPU pricing → for live rates.

GraphRAG's GPU cost sits almost entirely in entity extraction. The pipeline is batch-tolerant and checkpointed per document, so interrupted runs only lose the in-flight batch. H100 and H200 instances on Spheron are on-demand only. For 8B-class extraction jobs using the A100 80GB PCIe, spot pricing at $1.14/hr offers real savings on long indexing runs. On-demand instances handle live query serving.
H100 pricing on Spheron → | On-demand H200 → | View all GPU pricing →
Get started on Spheron →

STEPS / 07

Quick Setup Guide

Provision GPU instances on Spheron for each pipeline stage
Entity extraction needs an H100 SXM5 (80GB) or 2x H100 for 70B FP8 models, or a single H200 (141GB) for BF16. Embedding needs an L40S (48GB) or RTX PRO 6000 (96GB). Graph storage (Neo4j or Kuzu) runs on a CPU-only node. H100 and H200 instances are on-demand only; for 8B-class extraction jobs where lower precision is acceptable, the A100 80GB PCIe has spot pricing available for additional cost savings.
Deploy vLLM for entity extraction
SSH into the H100 or H200 instance. Install vLLM: pip install vllm. Launch the entity extraction LLM: vllm serve meta-llama/Llama-3.1-70B-Instruct --dtype fp8 --tensor-parallel-size 1 --port 8000 --max-model-len 8192 --gpu-memory-utilization 0.85. For 2x H100 with tensor parallelism: add --tensor-parallel-size 2.
Deploy the embedding model with Text Embeddings Inference
On the embedding GPU instance (L40S or RTX PRO 6000), run TEI: docker run --gpus all -p 8001:80 ghcr.io/huggingface/text-embeddings-inference:latest --model-id BAAI/bge-large-en-v1.5 --dtype float16. Or use vLLM in embedding mode: vllm serve BAAI/bge-large-en-v1.5 --task embed --port 8001.
Set up graph storage with Neo4j or Kuzu
Neo4j: docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/yourpassword neo4j:latest. Kuzu: pip install kuzu; graph storage is embedded, no server required. Set the corresponding environment variable in GraphRAG's settings.yaml: storage.type: neo4j or storage.type: kuzu.
Configure Microsoft GraphRAG for self-hosted LLM endpoints
Install: pip install graphrag. Initialize: graphrag init --root ./ragproject. Edit settings.yaml: set llm.api_base to your vLLM endpoint (http://<h100-ip>:8000/v1), llm.model to meta-llama/Llama-3.1-70B-Instruct, embeddings.llm.api_base to your TEI endpoint (http://<l40s-ip>:8001/v1), embeddings.llm.model to BAAI/bge-large-en-v1.5. Set llm.api_key to any non-empty string (vLLM ignores it).
Run indexing and community detection
Place source documents in ./ragproject/input/. Run indexing: graphrag index --root ./ragproject. GraphRAG will chunk text, extract entities and claims (multiple LLM calls per chunk), build the graph, run Leiden community detection, and generate hierarchical summaries. Monitor GPU utilization with nvidia-smi -l 1. For large corpora, increase chunk_size and overlap in settings.yaml and adjust llm.concurrent_requests for throughput.
Query with global and local search
Global search (for thematic, summary-level questions): graphrag query --root ./ragproject --method global --query 'What are the main topics covered?' Local search (for entity-specific questions): graphrag query --root ./ragproject --method local --query 'What did Alice say about the project budget?'. Wire these endpoints into your application via the graphrag Python API or the built-in CLI.

FAQ / 05

Frequently Asked Questions

Vector RAG embeds documents into a flat vector index and retrieves chunks by similarity. GraphRAG first extracts entities and relationships from your corpus, builds a knowledge graph, runs community detection (typically the Leiden algorithm), and generates hierarchical summaries. At query time it can answer global questions ('what are the main themes across this corpus?') that vector RAG cannot - because vector search has no concept of document-level structure or cross-document relationships. The trade-off is that GraphRAG indexing is 20-100x more expensive than embedding: every source chunk requires multiple LLM calls for entity extraction.

Entity extraction is the GPU bottleneck. Microsoft GraphRAG's default configuration uses a 70B-class LLM for extraction, which requires ~140GB VRAM in BF16 or ~70GB in FP8. A single H100 SXM5 (80GB) with FP8 quantization handles a 70B extraction model. For BF16 or for higher throughput, two H100s with tensor parallelism (tp=2) or a single H200 (141GB) are the right sizing. The embedding stage is much lighter: an L40S (48GB) or RTX PRO 6000 (96GB) is sufficient for text-embedding-3 class models.

Partially, depending on which GPU you use. GraphRAG indexing is batch-tolerant and checkpointed per document, so a spot interruption only loses the in-flight batch. H100 and H200 spot instances are not currently available on Spheron - those GPUs are on-demand only. If you use the A100 80GB PCIe for 8B-class extraction, spot pricing is available at $1.14/hr vs $1.69/hr on-demand, a meaningful saving for long batch runs. For 70B extraction on H100 or H200, plan for on-demand rates. Run retrieval and generation on on-demand instances to guarantee query availability.

Cost depends on extraction LLM, parallelism, and corpus density. As a baseline: Microsoft GraphRAG with default settings (Llama 3.1 70B FP8 extraction, 300-token chunk overlap) runs roughly 4-6 LLM calls per 512-token chunk - about 2000-3000 input tokens per source chunk including system prompt and extracted context. At ~4x token expansion, indexing 1M source tokens requires processing 4-6M LLM input tokens. On an H100 at roughly 3000 tok/sec throughput (70B FP8, vLLM), that is 22-33 minutes of GPU time, or $1.60-$2.41 at on-demand H100 rates ($4.34/hr). Community detection and summarization add another 30-50% on top.

Neo4j is the most battle-tested option and has an official Python GraphRAG integration (neo4j-graphrag-python). Kuzu is a high-performance embedded graph database that is simpler to operate (no separate server process) and works well for single-node setups. For large-scale production with graph traversal queries across tens of millions of nodes, Neo4j's native Cypher engine is faster. For mid-scale deployments where operational simplicity matters more than query throughput, Kuzu requires less infrastructure. Both run on any Spheron CPU or GPU instance with Docker.

GraphRAG vs Vector RAG vs Agentic RAG

GraphRAG Pipeline Anatomy

Entity and Claim Extraction

Graph Construction

Community Detection

Hierarchical Summarization

Retrieval: Global vs Local Search

GPU Compute Breakdown by Pipeline Stage

Hardware Requirements: Entity Extraction at Scale

Alternative GraphRAG Stacks

LightRAG

nano-graphrag

neo4j-graphrag

LlamaIndex PropertyGraphIndex

Step-by-Step: Deploy GraphRAG on Spheron

Step 1: Provision GPU Instances

Step 2: Deploy vLLM for Entity Extraction

Step 3: Deploy the Embedding Model

Step 4: Set Up Graph Storage

Step 5: Configure Microsoft GraphRAG

Step 6: Run Indexing

Step 7: Query with Global and Local Search

Performance Tuning

Cost Analysis: GraphRAG Indexing and Query

Indexing Cost per Million Source Tokens

Per-Query Inference Cost

Real-Time vs Batch GraphRAG

Live Spheron Pricing for GraphRAG Workloads

Quick Setup Guide

Provision GPU instances on Spheron for each pipeline stage

Deploy vLLM for entity extraction

Deploy the embedding model with Text Embeddings Inference

Set up graph storage with Neo4j or Kuzu

Configure Microsoft GraphRAG for self-hosted LLM endpoints

Run indexing and community detection

Query with global and local search

Frequently Asked Questions

01What is GraphRAG and how does it differ from standard vector RAG?

02What GPU do I need for GraphRAG indexing at scale?

03Can I run GraphRAG indexing on spot instances?

04How much does GraphRAG indexing cost per million tokens of source corpus?

05Which graph database works best with GraphRAG on Spheron?

Build what's next.