At 50 million vectors, the bottleneck in a production search pipeline is not query latency. It is index rebuild time. CPU HNSW on a 32-core machine takes 8-12 hours to rebuild a 50M-vector index. That window determines how stale your search results are. For the broader context on co-locating vector search with your LLM stack, the agentic RAG infrastructure guide covers GPU memory planning and stack colocation in depth.
NVIDIA cuVS changes the rebuild economics. CAGRA, the graph-based ANN algorithm inside cuVS, builds the same index in under 30 minutes on a single H100. The math is straightforward: GPU parallelism across thousands of CUDA cores lets CAGRA run the graph construction step that CPU HNSW must serialize node by node. This post covers what cuVS and CAGRA actually are, which databases have integrated them, how to run a CAGRA index build on a Spheron GPU instance, and when GPU indexing is worth the added infrastructure step.
What cuVS and CAGRA Are
cuVS (CUDA Vector Search) is NVIDIA's open-source library for GPU-accelerated approximate nearest neighbor algorithms. It ships under the RAPIDS umbrella and provides GPU implementations of several ANN algorithms, including brute-force flat search, IVF-PQ, and CAGRA. The library is the underlying engine that Faiss, Milvus, and OpenSearch call when GPU indexing is enabled.
CAGRA (CUDA ANN GRAph-based algorithm) is cuVS's graph-based ANN algorithm. The concept is similar to HNSW: build a graph where each vector points to its nearest neighbors, and at query time traverse that graph to find approximate nearest neighbors quickly. The difference is where the work happens. HNSW construction is inherently sequential: you insert nodes one at a time, each insertion requiring a graph traversal to find the right place. CAGRA builds the graph in parallel across CUDA cores, compressing the most expensive part of index construction from hours to minutes.
Key parameters:
graph_degree: the number of neighbors each node points to. Equivalent to HNSW'sMparameter. Higher values improve recall but increase VRAM and build time. 64 is a reasonable default.intermediate_graph_degree: affects build quality. Set to 2xgraph_degreeas a starting point (128 if graph_degree=64).build_algo: the construction algorithm.IVF_PQis required for corpora that don't fit entirely in VRAM during construction;NN_DESCENTis faster for smaller corpora.
CAGRA has no CPU fallback during construction. The entire dataset must fit in GPU VRAM. At query time, you can run CAGRA search on GPU for maximum throughput, or export the index to a CPU-readable HNSW format for lower-cost serving.
The 2026 Ecosystem: Where cuVS Is Integrated
| System | cuVS Integration | Build Acceleration | Notes |
|---|---|---|---|
| Faiss | faiss-gpu (CUDA build) | CAGRA build, GPU-flat search | Requires faiss-gpu-cu12 |
| Milvus 2.5/2.6 | Native GPU_CAGRA, GPU_IVF_PQ index types | 12x vs CPU HNSW at 10M vectors | GPU image: milvusdb/milvus:v2.5.27-gpu |
| OpenSearch 3.0 | GPU index build plugin | GPU-accelerated build, CPU serving | Available in nightly as of Q1 2026; check OpenSearch release notes for GA status |
| Elasticsearch | GPU build acceleration via cuVS (experimental) | Reduces force-merge latency (exact figures unverified; check Elastic release notes) | Requires ES Platinum/Enterprise tier |
| KDB.AI | Announced cuVS integration | In-process CAGRA for time-series + vector | GA roadmap TBD |
Qdrant ships its own GPU-accelerated HNSW indexing as of Qdrant 1.13 (not cuVS-based). Weaviate remains CPU-only for ANN construction as of mid-2026. If your stack is Weaviate, the GPU accelerates embedding throughput (co-locate a TEI server on the same node) but not the index build itself. For Qdrant, GPU indexing is available but uses Qdrant's own acceleration path, not cuVS. For full deployment of self-hosted embedding models on GPU, see self-hosted embeddings with TEI.
Why Index Build Time Is the Real Bottleneck at Scale
Query latency is already fast on CPU for most workloads. A 10M-vector HNSW index on a 32-core CPU returns results in 3-8ms p50, which is acceptable for most RAG and search applications. The problem is not search speed.
The problem is freshness. RAG pipelines and search indexes are not static. New documents arrive continuously. At weekly or daily re-indexing frequency, the gap between what the index contains and what the document store contains determines how stale your search results are. Most teams run full re-indexes rather than incremental builds because incremental HNSW insertion at scale is unreliable and support is fragmented across databases.
At 50M vectors on a 32-core CPU, HNSW build takes 8-12 hours. That is most of the day. At 50M vectors with CAGRA on an H100 SXM5, the same build takes under 30 minutes. That is a different architecture decision: a 30-minute rebuild window means you can run it twice a day without affecting search quality.
The query speed advantage of CAGRA over CPU HNSW (sub-1ms vs 3-8ms) matters for high-QPS workloads, but it is secondary. The primary reason to use GPU indexing is the build time reduction, and that pays off specifically for teams that need frequent re-indexing of large corpora.
Hardware Sizing: VRAM Per Billion Vectors
CAGRA requires the full dataset in GPU VRAM during construction. At float32 precision, 1M vectors at 1536 dimensions requires approximately 6 GB of raw data; the CAGRA graph index adds 20-30% overhead on top.
| Corpus Size | Dim | Raw float32 | int8 compressed | Recommended GPU | On-Demand | Spot |
|---|---|---|---|---|---|---|
| 1M vectors | 1536 | ~6 GB | ~1.5 GB | H100 SXM5 80GB | $2.54/hr | $2.91/hr* |
| 10M vectors | 1536 | ~58 GB | ~14 GB | H100 SXM5 80GB (float32) | $2.54/hr | $2.91/hr* |
| 30M vectors | 1536 | ~174 GB | ~44 GB | B200 SXM6 192GB | N/A | $5.34/hr† |
| 100M vectors | 1536 | ~580 GB | ~145 GB | Sharded H100 cluster | check pricing | N/A |
Pricing fluctuates based on GPU availability. The prices above are based on 29 Jun 2026 and may have changed. Check current GPU pricing for live rates.
For the B200 SXM6, only spot instances are currently available in the Spheron marketplace; on-demand pricing is not listed at the time of writing. The B200 spot rate is $5.34/hr per GPU (†spot instances can be reclaimed without notice). The H100 SXM5 on-demand rate ($2.54/hr) comes from an 8-GPU dedicated bundle. The H100 spot rate ($2.91/hr*) is currently elevated above the bundle on-demand rate due to constrained spot capacity for H100, so dedicated on-demand bundles are the more cost-effective choice for sustained index builds at this time. Check the current GPU pricing page for live rates.
For corpora beyond what fits on a single GPU, the practical path is either int8 quantization (reduces VRAM 4x with modest recall loss) or GPU-IVF-PQ (builds in blocks, does not require the full dataset in VRAM simultaneously).
Step-by-Step: Build a CAGRA Index on a Spheron GPU
1. Provision and Connect
Rent an H100 SXM5 instance on Spheron via the console or CLI. For the trade-offs between spot and dedicated instances (preemption risk, SLA guarantees, and when each fits), see the Spheron instance types guide. SSH in and verify the GPU is visible:
# Confirm GPU availability
nvidia-smi
# Install dependencies (CUDA 12 environment)
pip install 'faiss-gpu-cu12>=1.10.0' numpy h5pyPin faiss to 1.10.0 or later. GpuIndexCagra and the CAGRA-to-HNSW conversion were introduced in Faiss 1.10.0; earlier versions will fail with an AttributeError on these classes. The 1.11.x series is current as of mid-2026.
2. Build with Faiss-GPU (CAGRA Backend)
import faiss
import numpy as np
import time
DIM = 1536
N = 10_000_000
# Load vectors (replace with your actual data source)
vectors = np.random.rand(N, DIM).astype("float32")
# Initialize GPU resources
res = faiss.StandardGpuResources()
# Configure CAGRA index
config = faiss.GpuIndexCagraConfig()
config.intermediate_graph_degree = 128
config.graph_degree = 64
config.build_algo = faiss.GpuIndexCagraConfig.IVF_PQ # for large corpora
# Build the index on GPU 0
index = faiss.GpuIndexCagra(res, DIM, faiss.METRIC_INNER_PRODUCT, config)
t0 = time.time()
index.add(vectors)
elapsed = time.time() - t0
print(f"Built {N:,} vectors in {elapsed:.1f}s ({N/elapsed:,.0f} vectors/sec)")
# Serialize as CPU index for portability
cpu_index = faiss.index_gpu_to_cpu(index)
faiss.write_index(cpu_index, "cagra.index")On an H100 SXM5 with 10M float32 vectors at 1536 dimensions, expect roughly 45-60 seconds for the build. The throughput should read somewhere around 150,000-220,000 vectors per second.
3. Build with Milvus GPU_CAGRA
Use the GPU-specific Milvus Docker image. The plain milvusdb/milvus image does not include the CUDA runtime required for CAGRA.
# docker-compose.yml (GPU Milvus stack)
version: '3.8'
services:
etcd:
image: quay.io/coreos/etcd:v3.5.18
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
command: etcd -advertise-client-urls=http://etcd:2379 -listen-client-urls http://0.0.0.0:2379
minio:
image: minio/minio:RELEASE.2024-05-28T17-19-04Z
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
command: minio server /minio_data
milvus:
image: milvusdb/milvus:v2.5.27-gpu
runtime: nvidia
environment:
NVIDIA_VISIBLE_DEVICES: all
KNOWHERE_GPU_MEM_POOL_SIZE: 4096
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
MINIO_ACCESS_KEY_ID: minioadmin
MINIO_SECRET_ACCESS_KEY: minioadmin
command: ["milvus", "run", "standalone"]
ports:
- "19530:19530"
depends_on:
- etcd
- minioThen create the collection and build the CAGRA index:
from pymilvus import connections, Collection, CollectionSchema, FieldSchema, DataType, utility
import numpy as np
connections.connect(host="localhost", port="19530")
N, DIM = 10_000_000, 1536
vectors = np.random.rand(N, DIM).astype("float32")
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields=fields, description="CAGRA vector index")
col = Collection("docs", schema)
# Insert in batches
BATCH_SIZE = 10_000
for i in range(0, len(vectors), BATCH_SIZE):
batch = vectors[i:i + BATCH_SIZE].tolist()
col.insert([batch])
col.flush() # seal growing segments before indexing
# Build GPU_CAGRA index
index_params = {
"index_type": "GPU_CAGRA",
"metric_type": "IP",
"params": {
"intermediate_graph_degree": 128,
"graph_degree": 64,
"build_algo": "IVF_PQ",
},
}
col.create_index("embedding", index_params)
col.load()
# Query with CAGRA search params
search_params = {"metric_type": "IP", "params": {"itopk_size": 128, "search_width": 4}}
results = col.search([[0.1] * 1536], "embedding", search_params, limit=10)Monitor GPU utilization during the build step with nvidia-smi -l 1. You should see utilization above 80% during construction.
4. Export for CPU Serving
Once the index is built on GPU, convert it to a format that CPU hosts can serve without GPU dependencies:
# Faiss: convert CAGRA graph to HNSW for CPU serving using IndexHNSWCagra.
cpu_index = faiss.IndexHNSWCagra()
cpu_index.base_level_only = True
index.copyTo(cpu_index) # transfers CAGRA graph into HNSW base layer
faiss.write_index(cpu_index, "serving.index")In Milvus 2.6, the hybrid mode handles this automatically. Set adapt_for_cpu: true under knowhere.GPU_CAGRA.load in milvus.yaml and Milvus will build on GPU but route queries through CPU at serve time. The build_algo parameter only controls graph construction algorithm (IVF_PQ vs NN_DESCENT) and does not affect the serving path. For Faiss, transfer serving.index to your CPU serving host via scp or object storage, then load it with:
# On the CPU serving host (no GPU required)
import faiss
index = faiss.read_index("serving.index")
D, I = index.search(query_vectors, k=10)Benchmark Walkthrough: 12x Faster Indexing
The following numbers come from the NVIDIA cuVS repository and Milvus 2.5 documentation. Actual results vary based on dataset characteristics, vector dimension, hardware configuration, and the specific CAGRA parameter settings used.
| GPU | Dataset | Index Type | Build Time | Build Speedup | p50 Query | Recall@10 |
|---|---|---|---|---|---|---|
| H100 SXM5 | 10M x 1536 float32 | CAGRA | ~45s | 30x vs CPU HNSW | 0.6ms | 0.97 |
| CPU 32-core | 10M x 1536 float32 | HNSW | ~22 min | 1x baseline | 3.8ms | 0.95 |
| CPU 32-core | 10M x 1536 int8 | HNSW | ~14 min | 1x baseline | 4.1ms | 0.94 |
| Elasticsearch (GPU build, experimental) | Force-merge benchmark | cuVS | N/A | Reduced merge latency vs CPU (exact figures unverified) | N/A | N/A |
The table's "30x" is a derived figure from the test configuration above (22 min CPU baseline vs ~45s CAGRA build). The "12x faster" headline in the post title comes from Milvus 2.5 documentation, which uses a different CPU server baseline than the 32-core machine in the table. Both figures apply to the 10M float32 vector scale; the different CPU hardware baselines produce different ratios. The Milvus 12x figure is the officially cited number; the 30x figure reflects the specific hardware configuration in the table. At different scales, the speedup also varies: smaller datasets show lower relative speedup (shorter absolute times, less opportunity for GPU parallelism), and larger datasets can show higher relative speedup.
Cost Model: Burst GPU vs Always-On
There are two ways to approach GPU indexing from a cost perspective.
Burst model (recommended for most teams): Provision an H100 or spot GPU only during the re-indexing window, then terminate the instance. A nightly re-index of 50M vectors with CAGRA completes in under 30 minutes. At $2.54/hr on-demand for the H100 SXM5, that's roughly $1.27 per nightly rebuild. Monthly cost: under $40. The instance is terminated after the index is pushed to object storage and hot-swapped into the serving layer.
Always-on GPU model: Necessary only if you need sub-1ms query latency at high QPS (5,000+ queries per second) and can justify a dedicated GPU instance for serving. Most RAG workloads are fine with CPU serving after a CAGRA-built index; the H100 SXM5 is only needed during the build window.
| Rebuild Frequency | Corpus | GPU Time / Build | Monthly GPU Spend | Notes |
|---|---|---|---|---|
| Nightly | 10M vectors | ~5 min | ~$6.35 | H100 SXM5 on-demand at $2.54/hr, 30 builds |
| Nightly | 50M vectors | ~25 min | ~$31.75 | H100 SXM5 on-demand, 30 builds |
| Weekly | 50M vectors | ~25 min | ~$4.23 | H100 SXM5 on-demand, 4 builds/month |
| CI/CD triggered | Any | Minutes | Pay-per-build | Terminate immediately after upload |
For spot pricing strategy on batch GPU workloads, see spot instance bidding strategies for how to structure preemption-tolerant pipelines. The broader picture of GPU cost levers is in the GPU cost optimization playbook.
Integration Recipes
Faiss Pipeline (Standalone)
Full end-to-end pipeline: build on GPU, export for CPU serving.
import faiss
import numpy as np
import h5py
import time
# 1. Load corpus vectors
with h5py.File("corpus.h5", "r") as f:
vectors = f["embeddings"][:] # shape: (N, DIM), dtype float32
N, DIM = vectors.shape
print(f"Loaded {N:,} vectors, dim={DIM}")
# 2. Build CAGRA on GPU
res = faiss.StandardGpuResources()
config = faiss.GpuIndexCagraConfig()
config.intermediate_graph_degree = 128
config.graph_degree = 64
config.build_algo = faiss.GpuIndexCagraConfig.IVF_PQ
gpu_index = faiss.GpuIndexCagra(res, DIM, faiss.METRIC_INNER_PRODUCT, config)
t0 = time.time()
gpu_index.add(vectors)
print(f"CAGRA build: {time.time() - t0:.1f}s")
# 3. Convert to HNSW for CPU serving
cpu_index = faiss.IndexHNSWCagra()
cpu_index.base_level_only = True
gpu_index.copyTo(cpu_index)
# 4. Save index
faiss.write_index(cpu_index, "serving.index")
print("Index saved to serving.index")On the CPU serving host (no GPU required):
import faiss
import numpy as np
index = faiss.read_index("serving.index")
# Set HNSW search parameters
faiss.downcast_index(index).hnsw.efSearch = 128
query = np.random.rand(1, 1536).astype("float32")
D, I = index.search(query, k=10)
print("Top-10 neighbors:", I[0])Milvus 2.5 Pipeline
Using pymilvus for a full insert-index-query workflow. For full Milvus production deployment including compose file, sharding, and replica configuration, see the self-hosted vector database deployment guide.
from pymilvus import connections, Collection, CollectionSchema, FieldSchema, DataType
import numpy as np
connections.connect(host="localhost", port="19530")
# Schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields=fields)
col = Collection("documents", schema)
# Batch insert (10k chunks to avoid memory pressure)
DIM = 1536
N = 1_000_000
BATCH = 10_000
for i in range(0, N, BATCH):
batch_size = min(BATCH, N - i)
texts = [f"doc_{i+j}" for j in range(batch_size)]
embeddings = np.random.rand(batch_size, DIM).astype("float32").tolist()
col.insert([texts, embeddings])
col.flush() # seal growing segments before indexing
# Build GPU_CAGRA index
col.create_index(
"embedding",
{
"index_type": "GPU_CAGRA",
"metric_type": "IP",
"params": {"intermediate_graph_degree": 128, "graph_degree": 64, "build_algo": "IVF_PQ"},
},
)
col.load()
print("Index built and loaded")
# Search
search_params = {"metric_type": "IP", "params": {"itopk_size": 128, "search_width": 4}}
query = np.random.rand(1, DIM).astype("float32").tolist()
results = col.search(query, "embedding", search_params, limit=10, output_fields=["text"])
for hit in results[0]:
print(f" id={hit.id}, score={hit.distance:.4f}")OpenSearch 3.0 / Elasticsearch 9.3 (Brief)
For OpenSearch 3.0, the GPU index build plugin hooks into k-NN index creation. Add "engine": "faiss" and "space_type": "l2" to the k-NN field mapping, and the GPU plugin handles acceleration transparently on GPU-equipped nodes. The plugin is available in OpenSearch 3.0 nightly builds as of Q1 2026; check the OpenSearch release notes for GA status before using in production.
For Elasticsearch, GPU build acceleration targets the force-merge step that consolidates HNSW segments, running segment merge on GPU instead of CPU via the cuVS Java API. This is experimental as of mid-2026; specific version numbers and latency figures vary by release. Check the Elastic release notes for current status and tier requirements before using in production.
When GPU Indexing Is NOT Worth It
GPU indexing adds operational complexity: you need to provision a GPU instance, manage the build pipeline, transfer the index file, and terminate the instance. That overhead is worth it at the right scale. It is not worth it when:
- Your corpus is under 5M vectors and CPU HNSW builds in under 10 minutes anyway.
- Re-indexing frequency is weekly or lower, and the CPU build window already fits comfortably.
- Your team does not have a pipeline to provision and terminate GPU instances automatically. Manual GPU management for a nightly job is not a sustainable workflow.
- You are using Qdrant or Weaviate. Neither supports cuVS as of mid-2026, so GPU indexing is not an option regardless of your hardware.
- Your vectors are already int8-quantized and the compressed HNSW index fits comfortably within CPU build time targets.
For GraphRAG pipelines that co-locate a vector index alongside entity extraction, rebuilding the index on GPU is faster, but the entity extraction LLM is usually the primary GPU cost. The GraphRAG deployment guide covers GPU budget allocation for those workflows.
The honest summary: GPU indexing is a build-time optimization for large, frequently-updated corpora. If your corpus is static or small, stick with CPU HNSW. If you are running nightly re-indexes of 10M+ vectors and the build window is already tight, CAGRA on a spot H100 is a straightforward fix.
Running CAGRA index builds on GPU for 10M+ vector corpora cuts nightly re-indexing from hours to minutes. Spheron provides on-demand and spot H100 instances where you can spin up, build, and tear down, paying only for the build window.
Check H100 availability on Spheron | B200 GPU rental | View all GPU pricing
Quick Setup Guide
Log into app.spheron.ai and select an on-demand or spot GPU instance. H100 SXM5 80GB for large corpora (10M+ float32 vectors at 1536 dimensions). B200 192GB for 30M+ vector builds in a single pass. Choose on-demand for production rebuild pipelines where predictable scheduling matters, or spot for lower priority or development builds. SSH in and verify: nvidia-smi.
Install via pip (CUDA 12 build): pip install 'faiss-gpu-cu12>=1.10.0' cupy-cuda12x. GpuIndexCagra was introduced in Faiss 1.10.0; the 1.11.x series is current. Or build from source for the latest cuVS features: pip install cuml-cu12 --extra-index-url=https://pypi.nvidia.com. Confirm the cuVS CAGRA backend is available: python -c 'import faiss; print(faiss.get_compile_options())' - the output should include GPU and CAGRA.
Load your float32 vectors into a NumPy array. Create a GPU resource: res = faiss.StandardGpuResources(). Build a flat GPU index as a reference, or build CAGRA directly using GpuIndexCagraConfig with graph_degree=64 and intermediate_graph_degree=128. Add vectors and time the build with time.time(). Save with faiss.write_index(faiss.index_gpu_to_cpu(gpu_index), 'cagra.index').
Use the GPU-enabled Milvus image: milvusdb/milvus:v2.5.27-gpu (not the plain milvus image). Create a collection and set index type to GPU_CAGRA: index_params = {'index_type': 'GPU_CAGRA', 'metric_type': 'IP', 'params': {'intermediate_graph_degree': 128, 'graph_degree': 64, 'build_algo': 'IVF_PQ'}}. Insert vectors and call collection.create_index(). Monitor build with nvidia-smi during the build step. At query time set itopk_size=128 and search_width=4 for balanced recall/latency.
In Faiss, convert CAGRA to HNSW for CPU serving using the IndexHNSWCagra class: create cpu_index = faiss.IndexHNSWCagra(), set cpu_index.base_level_only = True, then call gpu_index.copyTo(cpu_index). Save with faiss.write_index(cpu_index, 'serving.index'). In Milvus 2.6, enable CPU serving by setting adapt_for_cpu: true under knowhere.GPU_CAGRA.load in milvus.yaml (note: build_algo only controls graph construction, not the serving path). Transfer the index file to a CPU serving host via scp or object storage. Stop the GPU instance once the transfer is complete.
Write a Python script that: (1) pulls new vectors from your document store or pipeline, (2) provisions a Spheron GPU instance via the Spheron API or CLI, (3) runs the CAGRA build, (4) uploads the index to S3-compatible object storage, (5) signals the serving layer to hot-swap the index, (6) terminates the GPU instance. Schedule with cron or Airflow. Use spot pricing for development runs and on-demand for production nightly jobs where completion time is guaranteed.
Frequently Asked Questions
cuVS (CUDA Vector Search) is NVIDIA's open-source library for GPU-accelerated approximate nearest neighbor (ANN) algorithms. CAGRA (CUDA ANN GRAph-based algorithm) is its graph-based ANN algorithm for index construction and search on CUDA. CAGRA builds a compressed graph of proximity neighbors entirely on GPU, cutting index construction 10-30x faster than CPU HNSW for the same recall targets. cuVS is the backend powering GPU indexing in Faiss, Milvus 2.5+, and OpenSearch 3.0+.
As of mid-2026, cuVS is integrated into four major ecosystems: Faiss (via the faiss-gpu build), Milvus 2.5/2.6 (native CAGRA and GPU-IVF-PQ index types), OpenSearch 3.0 (GPU index build plugin), and Elasticsearch (GPU build acceleration in the vector search tier, experimental). KDB.AI has announced a cuVS integration for its in-process vector store. Qdrant ships its own GPU-accelerated HNSW indexing (Qdrant 1.13+, not cuVS-based). Weaviate remains CPU-only for ANN construction as of this writing.
CAGRA requires the full vector dataset to fit in GPU VRAM during index construction. At 1536-dimensional float32, 1 billion vectors require about 6 TB of raw data - far beyond a single GPU. In practice, CAGRA is used for corpora up to 100M-200M vectors per GPU (an H100 SXM5 with 80GB can fit roughly 13M float32 1536-dim vectors before CAGRA graph overhead; use int8 quantization or GPU-IVF-PQ for larger corpora). For billion-scale datasets, the practical path is DiskANN (SSD-backed) or sharded CAGRA across multiple GPUs.
Yes, this is the recommended pattern for cost-sensitive workloads. Faiss supports exporting a CAGRA-built graph to a CPU-readable HNSW layout using the IndexHNSWCagra class: create cpu_index = faiss.IndexHNSWCagra(), set cpu_index.base_level_only = True, then call gpu_index.copyTo(cpu_index) to copy the CAGRA graph into the HNSW base layer. Milvus 2.6's hybrid CAGRA mode builds on GPU and serves on CPU automatically. The workflow is: rent a GPU for the nightly or weekly re-index job, build the CAGRA index, export to CPU-compatible format, drop the GPU instance, and serve from CPU. GPU costs are only incurred during the build window.
Probably not. CPU HNSW on a modern 32-core server builds a 1M x 768-dimension index in under 3 minutes. CAGRA on an H100 takes roughly 8 seconds for the same task - a big relative speedup, but the absolute time difference is small enough that the GPU instance cost does not break even unless you are rebuilding several times per day. The crossover point where GPU indexing pays off is roughly 5-10M vectors, where CPU HNSW build times exceed 30-60 minutes and nightly re-indexing windows become tight.
