Why do vector databases need GPUs in 2026?

Billion-scale indexes hit a hard wall on CPU. NVIDIA CAGRA (graph-based ANN on GPU) builds a 100M-vector index in minutes versus hours on CPU. GPU-IVF flat search delivers 10-50x higher QPS on the same hardware. Beyond search, every vector database co-located with an LLM inference server benefits from GPU vectorization - embedding throughput on H100 is 100x faster than CPU. At production RAG scale, the GPU handles both vectorization and index acceleration.

Which GPU is best for Qdrant, Milvus, and Weaviate in production?

The right GPU depends on which database and workload size. For Milvus 2.5 with CAGRA and 100M+ vectors: H100 80GB SXM5 gives the fastest index builds and 10,000+ QPS. For Qdrant with co-located GPU vectorization: RTX PRO 6000 (96GB GDDR7) fits both a 30B embedding model and a large HNSW index. For Weaviate with GPU vectorizer modules: L40S (48GB) covers mid-size corpora and concurrent vectorization at lower cost. For full-stack colocation (vector DB plus vLLM): H100 SXM5 is the default when serving 70B+ models.

How much VRAM do I need for 1 billion vectors?

At 1536-dimensional float32 vectors, 1 billion vectors require roughly 6TB of raw storage. In-memory GPU indexes are not practical at this scale. Milvus with DiskANN (SSD-backed) reduces the GPU VRAM requirement to the index graph overhead: around 60-120GB for 1B vectors depending on the M parameter. For 100M vectors at 1536 dimensions in float32, GPU in-memory indexing needs about 600GB - only feasible in sharded deployments across multiple GPUs. Use product quantization (PQ) or binary quantization to bring 1B vectors under 48GB at the cost of some recall.

Can I run a vector database and LLM inference on the same GPU server?

Yes, and this is the key latency advantage. On an H100 SXM5 (80GB), you can run Milvus with a 10M-vector CAGRA index (~6GB VRAM for the GPU index) alongside a Llama 3.3 70B FP8 model (~70GB VRAM) with the vLLM server. The LLM and vector DB share the same physical memory bus, cutting inter-service latency from 30-250ms (network) to under 1ms (local IPC). For full VRAM planning details, see the agentic RAG infrastructure guide.

What is the difference between CAGRA, HNSW, and DiskANN for vector search?

HNSW (Hierarchical Navigable Small World) is the default CPU-based graph algorithm used by Qdrant, Weaviate, and Milvus. It provides excellent recall at low latency for datasets under 50M vectors but index build time scales poorly. CAGRA is NVIDIA's GPU-accelerated variant of graph-based ANN - it builds the index on GPU (10-50x faster than CPU HNSW) and serves queries on GPU with sub-millisecond latency per query. DiskANN is an SSD-backed algorithm that trades latency for the ability to serve billion-scale indexes from NVMe rather than RAM or VRAM. Milvus 2.5 supports all three. Qdrant supports HNSW. Weaviate supports HNSW and flat.

Self-Host Vector Databases on GPU Cloud: Qdrant, Milvus, and Weaviate Production Deployment (2026)

Most production RAG pipelines split their vector database and LLM inference across two separate services. Every query crosses two network boundaries: one to the vector database (30-250ms p99) and one to the LLM API (100-600ms p99). Self-hosting both on one GPU node removes those boundaries entirely. The RAG pipeline bare metal case study shows a concrete example: one team cut p99 latency from 1.8 seconds to 190ms by co-locating all three components on the same GPU server. For the broader infrastructure context, the agentic RAG GPU infrastructure guide covers GPU memory planning and stack co-location in depth.

This post covers production deployment of three open-source vector databases: Qdrant, Milvus 2.5 (with NVIDIA CAGRA), and Weaviate on Spheron GPU cloud. Topics include CAGRA configuration, HNSW tuning parameters, sharding and replica strategy, co-location with vLLM, and cost economics.

Why Vector Search Needs GPUs in 2026

There are three distinct ways GPUs accelerate vector search workflows. They apply to different databases and different parts of the pipeline.

GPU Vectorization

Embedding throughput on GPU is not a minor improvement. BGE-M3 on an A100 processes roughly 60,000 tokens per second (per TEI benchmark data); H100 SXM5 exceeds this. On CPU, the same model tops out at around 600 tokens/sec. If your RAG pipeline re-encodes queries at every retrieval step, CPU-based embedding is a hard bottleneck at any meaningful concurrency. This benefit applies to all three databases: Qdrant, Milvus, and Weaviate all accept pre-computed vectors, so you can co-locate any GPU embedding server alongside them. For full deployment details on self-hosting embedding models, see self-hosted embeddings with TEI.

GPU-Accelerated ANN Indexing: CAGRA

NVIDIA CAGRA (CUDA ANNS GRAph-based) is a graph-based approximate nearest neighbor algorithm that runs entirely on CUDA. Index build on GPU is dramatically faster than CPU HNSW. For a 10M-vector, 1536-dimension dataset, CAGRA on an H100 SXM5 builds the index in roughly 45 seconds. CPU HNSW takes 18-22 minutes for the same corpus. At query time, CAGRA returns results in under 1ms on GPU versus 3-8ms on CPU HNSW. These numbers come from the NVIDIA cuVS benchmarks and the Milvus 2.5 documentation. Milvus is currently the only open-source vector database with native CAGRA support.

GPU-IVF and GPU-Flat Search

For smaller corpora or situations where exact search is required, GPU-flat (brute-force) search on an H100 returns results in about 2ms for 10M vectors. CPU takes around 150ms for the same operation. GPU-flat is useful for generating recall ground truth and for small corpora where approximate search introduces unacceptable recall loss.

CAGRA vs HNSW vs DiskANN

HNSW is the standard CPU graph-based ANN algorithm. Qdrant, Weaviate, and Milvus all use HNSW as their CPU path. It performs well for datasets under 50M vectors and has broad tooling support. Build time scales roughly O(n log n), which becomes painful above 50M vectors.

CAGRA runs on GPU and is Milvus-specific. It builds the same style of graph as HNSW but on CUDA, 10-50x faster. Queries run on GPU with sub-millisecond p50 latency. The constraint is that the entire index must fit in GPU VRAM. At 1536-dimension float32 vectors, 10M vectors require about 58GB of VRAM for the raw data; the CAGRA graph index adds 20-30% overhead on top of that. This makes CAGRA practical for corpora up to roughly 50M vectors on a single GPU, and larger corpora with multi-GPU sharding.

DiskANN is an NVMe-backed algorithm where the index graph lives on SSD rather than RAM or VRAM. Query latency is higher than CAGRA or in-memory HNSW (5-20ms p50) but memory requirements are orders of magnitude smaller. Milvus 2.5 supports DiskANN natively, making it the practical path for billion-scale corpora that cannot fit in GPU memory.

Choosing Between Qdrant, Milvus, and Weaviate

Feature	Qdrant	Milvus 2.5	Weaviate
GPU ANN index	No (CPU HNSW only)	Yes (CAGRA, GPU-IVF)	No (CPU HNSW/flat)
Built-in vectorizer	No	No (external)	Yes (GPU modules)
Payload filtering	Excellent	Good	Good
Sharding model	Automatic, custom	Manual shards	Multi-tenancy/shards
Kubernetes operator	Community	Official	Official
License	Apache 2.0	Apache 2.0	BSD-3
Server VRAM footprint	~200MB	~2-4GB	~500MB
Recommended tier	RTX PRO 6000 / L40S	H100 SXM5	L40S / RTX PRO 6000

The decision rule is straightforward. If you need GPU-accelerated ANN search and your corpus is between 1M and 100M vectors, use Milvus with CAGRA. If you need strong payload filtering, simple operations, and don't need GPU ANN search, use Qdrant. If you want GPU vectorization handled by the database itself without managing a separate embedding server, use Weaviate with the text2vec-transformers module.

GPU and VRAM Sizing for Vector Search Workloads

Corpus Size	Dimensions	Raw float32	PQ-compressed	Recommended GPU	On-Demand Price
1M vectors	1536	~6 GB	~0.8 GB	L40S 48GB	$1.80/hr
10M vectors	1536	~58 GB	~8 GB	RTX PRO 6000 96GB	$1.70/hr
100M vectors	1536	~580 GB	~80 GB	8x H100 SXM5 sharded	~$23.20/hr
1B vectors	768	~3 TB	~370 GB (PQ)	DiskANN + NVMe	N/A (disk-backed)

The 100M+ vector case requires either product quantization (which drops recall by 2-5%) or multi-GPU sharding. For billion-scale corpora, DiskANN on NVMe with GPU-cached index graph layers is the practical path. The GPU memory requirements guide covers VRAM math in depth for LLM co-location planning alongside these numbers.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Qdrant Production Deployment on GPU Cloud

Qdrant uses CPU-based HNSW for all ANN search. The GPU benefit for Qdrant comes from co-locating a GPU embedding server on the same node. The GPU handles vectorization; Qdrant handles storage and retrieval.

Docker Deployment and Persistent Storage

bash

docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  --restart unless-stopped \
  qdrant/qdrant

Verify the node is healthy:

bash

curl localhost:6333/healthz

HNSW Tuning for Production

Three parameters control HNSW behavior in Qdrant:

m (graph connectivity): 16-32 for production. Higher m means more edges per node, better recall, but larger index size. Each additional edge costs roughly m * 4 bytes * vectors in memory.
ef_construct (build recall): 100-400. Controls how many candidates are evaluated during index construction. Higher values improve recall at the cost of build time.
hnsw_ef (query recall): 64-512. Set at query time via search params. This is the main knob for the recall/latency tradeoff.

Configuration	m	ef_construct	hnsw_ef	Use case
Throughput-optimized	16	100	64	High QPS, recall > 0.93 acceptable
Balanced	16	200	128	General production
Recall-optimized	32	400	256	When recall > 0.99 is required

Create a collection with production HNSW settings:

python

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff, OptimizersConfigDiff

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    hnsw_config=HnswConfigDiff(m=16, ef_construct=200, on_disk=False),
    optimizers_config=OptimizersConfigDiff(memmap_threshold=100000)
)

Scalar and Binary Quantization

Enable scalar quantization to reduce memory usage by 4x with minimal recall loss:

python

from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType

client.create_collection(
    collection_name="documents_quantized",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(type=ScalarType.INT8, quantile=0.99, always_ram=True)
    )
)

With rescore=true in search params, recall loss is typically 0-3%. Binary quantization gives 32x memory reduction at higher recall loss (2-8%), suitable for very large corpora where approximate results are acceptable.

Sharding for Large Collections

Qdrant supports distributed mode with configurable shards. Create a collection with explicit sharding:

python

client.create_collection(
    collection_name="large_corpus",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    shard_number=4,           # 4 shards for 10M+ vector collections
    replication_factor=2      # 2 replicas for production availability
)

Rule of thumb: 1 shard per 5M vectors. For collections above 10M vectors, use Qdrant's distributed mode with at least 2 nodes.

Production docker-compose with TEI Co-location

yaml

version: "3.8"
services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - ./qdrant_storage:/qdrant/storage
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 32G

  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:1.9
    command: --model-id BAAI/bge-m3 --port 80
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

Milvus 2.5 with NVIDIA CAGRA

This guide targets Milvus 2.5 because it preserves the pure-GPU CAGRA query path. Milvus 2.6 introduced a hybrid CAGRA architecture (GPU for graph construction, CPU for query serving) which scales differently and may be preferable if you have spare CPU cores but limited GPU VRAM. The 2.5 examples below port directly to 2.6 with the same parameters.

Milvus is the only open-source vector database with native GPU ANN support, via NVIDIA cuVS. CAGRA is the graph-based algorithm; GPU-IVF-PQ is the quantization variant for larger datasets.

Prerequisites

CUDA 12.x and NVIDIA drivers 535 or later
The GPU-enabled Milvus image: milvusdb/milvus:v2.5.27-gpu (not milvusdb/milvus:v2.5)

Using the non-GPU image causes CAGRA to silently fall back to CPU HNSW with no error. Confirm you have the right image with:

bash

docker run --rm milvusdb/milvus:v2.5.27-gpu cat /milvus/configs/milvus.yaml | grep -i cuda

docker-compose Configuration

yaml

version: "3.8"
services:
  etcd:
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    volumes:
      - etcd:/etcd

  minio:
    image: minio/minio:RELEASE.2023-03-13T19-46-17Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    command: minio server /minio_data --console-address ":9001"
    volumes:
      - minio:/minio_data

  milvus:
    image: milvusdb/milvus:v2.5.27-gpu
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
      KNOWHERE_GPU_MEM_POOL_SIZE: "4096"
      CUDA_VISIBLE_DEVICES: "0"
    ports:
      - "19530:19530"
      - "9091:9091"
    volumes:
      - milvus:/var/lib/milvus
    depends_on:
      - etcd
      - minio
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  etcd:
  minio:
  milvus:

Creating a CAGRA Index

python

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

connections.connect("default", host="localhost", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=4096),
]
schema = CollectionSchema(fields, description="Document corpus")
collection = Collection("documents", schema)

# Insert your vectors
collection.insert([embeddings, texts])
collection.flush()

# Build the CAGRA index
index_params = {
    "index_type": "GPU_CAGRA",
    "metric_type": "IP",
    "params": {
        "intermediate_graph_degree": 128,
        "graph_degree": 64,
        "build_algo": "IVF_PQ",    # faster builds; use NN_DESCENT for higher recall
    }
}
collection.create_index("embedding", index_params)
collection.load()

Search Parameters for CAGRA

python

search_params = {
    "metric_type": "IP",
    "params": {
        "itopk_size": 128,    # 64 for throughput, 256 for maximum recall
        "search_width": 4,
    }
}

results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=10,
    output_fields=["text"]
)

itopk_size is the primary recall/latency lever. At 64, expect p50 latency under 0.5ms and recall around 0.94. At 256, latency increases to 2-3ms but recall reaches 0.99.

GPU-IVF-PQ for Large Datasets

When the CAGRA index won't fit in GPU VRAM (roughly above 50M vectors at 1536 dimensions), switch to GPU-IVF-PQ:

python

index_params = {
    "index_type": "GPU_IVF_PQ",
    "metric_type": "IP",
    "params": {
        "nlist": 4096,
        "m": 32,
        "nbits": 8,
    }
}

Search with nprobe=64 for balanced recall/throughput. GPU-IVF-PQ stores compressed vectors in VRAM, reducing the VRAM footprint by roughly 4-8x at the cost of 3-5% recall loss.

Sharding and Replica Strategy

python

# At collection creation time
collection = Collection(
    "large_corpus",
    schema,
    shards_num=4,         # 1 shard per 50M vectors is a reasonable starting point
)

# Load with replicas for query node redundancy
collection.load(replica_number=2)

For 100M+ vector collections, distribute across 2-4 GPU nodes using Milvus Distributed mode with a dedicated query node pool.

Weaviate Production Deployment

Weaviate's GPU acceleration comes through the vectorizer module tier, not ANN search. The text2vec-transformers module runs on GPU; the HNSW index runs on CPU. This is the opposite of Milvus: Weaviate offloads vectorization to GPU, Milvus offloads indexing and search.

Docker Compose with GPU Vectorizer

Pin to a specific minor in production; track the Weaviate releases page for upgrades.

yaml

version: "3.8"
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.32
    ports:
      - "8080:8080"
      - "50051:50051"
    environment:
      QUERY_DEFAULTS_LIMIT: "25"
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      ENABLE_MODULES: "text2vec-transformers"
      TRANSFORMERS_INFERENCE_API: "http://t2v-transformers:8080"
      DEFAULT_VECTORIZER_MODULE: "text2vec-transformers"
      CLUSTER_HOSTNAME: "node1"
    volumes:
      - weaviate:/var/lib/weaviate

  t2v-transformers:
    image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-mpnet-base-v2
    environment:
      ENABLE_CUDA: "1"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  weaviate:

Choosing the Vectorizer Model

Model	Params	Latency on GPU	VRAM	Notes
all-MiniLM-L6-v2	22M	~6ms/batch	~150MB	Lowest VRAM, good for dev
all-mpnet-base-v2	109M	~10ms/batch	~450MB	Balanced quality and speed
Qwen3-Embedding-0.6B (custom)	600M	~3ms/batch on H100	~1.2GB	Strong multilingual recall

For higher-quality embeddings not available as Weaviate modules, run a separate TEI server and pass pre-computed vectors directly. See self-host embedding and reranker with TEI for the TEI setup.

Hybrid Search Configuration

Weaviate hybrid search combines vector similarity (ANN) with BM25 keyword search. Configure the schema class with both enabled:

python

import weaviate
import weaviate.classes.config as wvc

client = weaviate.connect_to_local(host="localhost", port=8080)

client.collections.create(
    name="Document",
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(),
    properties=[
        wvc.Property(
            name="content",
            data_type=wvc.DataType.TEXT,
        )
    ],
    replication_config=wvc.Configure.replication(factor=2),
    inverted_index_config=wvc.Configure.inverted_index(bm25_b=0.75, bm25_k1=1.2),
)

Query with hybrid search using alpha=0.75 to weight vector similarity over BM25:

python

collection = client.collections.get("Document")
results = collection.query.hybrid(
    query="your search query",
    alpha=0.75,
    limit=10,
)

HNSW Index Parameters

Weaviate uses the same HNSW algorithm as Qdrant. Production settings:

python

client.collections.create(
    name="Document",
    vector_index_config=wvc.Configure.VectorIndex.hnsw(
        ef=256,
        ef_construction=200,
        max_connections=64,
        vector_cache_max_objects=2000000,
    ),
)

ef is the query-time parameter (equivalent to Qdrant's hnsw_ef). Increase from the default 100 to 256 for production recall targets above 0.97.

Benchmark Results: Index Build Time and Query Latency

Results below are based on reported benchmarks from the Milvus and cuVS projects. Actual performance varies by dataset characteristics and host configuration.

GPU	Corpus	Index Type	Build Time	p50 Latency	p99 Latency	QPS (top-10)	Recall@10
H100 SXM5	10M x 1536	CAGRA	~45s	0.6ms	1.2ms	12,000	0.97
RTX PRO 6000	10M x 1536	CAGRA	~80s	0.9ms	1.8ms	8,000	0.97
L40S (int8)	10M x 1536	CAGRA	~95s	1.1ms	2.2ms	6,500	0.96
CPU (32-core)	10M x 1536	HNSW	~22min	3.8ms	8.5ms	1,200	0.95

At 10M x 1536 float32, the raw data is ~58GB plus CAGRA graph overhead of 20-30%, totaling roughly 70-75GB. The L40S (48GB) cannot fit this in VRAM at float32, so the L40S results above use int8 quantization (reducing VRAM to ~14-15GB). The RTX PRO 6000 (96GB) and H100 SXM5 (80GB) both fit the float32 index with headroom for co-located workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Co-locating Vector DB with LLM Inference for Sub-50ms RAG

Managed RAG pipelines (Pinecone + OpenAI API) have two external service calls per query: one to the vector database and one to the LLM. Each adds 30-250ms network latency. Self-hosting both on one GPU server makes both calls local, over shared memory rather than a network interface.

Architecture Overview

A single-node co-located stack runs:

vLLM on port 8000 for LLM inference
Milvus on port 19530 (management) and 19121 (GPU search)
TEI embedding server on port 8080

All services communicate over localhost. Network latency between components is effectively 0ms.

VRAM Budget on H100 SXM5 (80GB)

Component	Model/Index	VRAM
LLM (vLLM)	Llama 3.3 70B FP8	~70 GB
Vector index (Milvus CAGRA)	1M x 1536 float32	~6 GB
Embedding server (TEI)	BGE-M3	~1 GB
Total	~77 GB

This leaves about 3GB for CUDA kernel overhead. For larger indexes, use RTX PRO 6000 (96GB) or switch to GPU-IVF-PQ with int8 quantization to reduce index VRAM.

Latency Math for Sub-50ms RAG

Breaking down the 50ms budget:

Step	Local (same GPU node)	Remote (separate services)
Query embedding	1-3ms	100-400ms p99
Vector search	0.5-2ms	30-250ms p99
LLM TTFT (H100 SXM5)	15-25ms	100-600ms p99
Network overhead	0ms	30-250ms
Total	17-30ms	260-1500ms

The full local round-trip is 17-30ms. The same query routed through external APIs is 260-1500ms. The difference comes entirely from eliminating network hops.

Single docker-compose for the Full Stack

yaml

version: "3.8"
services:
  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:1.9
    command: --model-id BAAI/bge-m3 --port 80
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  milvus:
    image: milvusdb/milvus:v2.5.27-gpu
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
      KNOWHERE_GPU_MEM_POOL_SIZE: "1024"
      CUDA_VISIBLE_DEVICES: "0"
    ports:
      - "19530:19530"
    volumes:
      - milvus:/var/lib/milvus
    depends_on:
      - etcd
      - minio
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  vllm:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.3-70B-Instruct
      --dtype fp8
      --gpu-memory-utilization 0.85
      --port 8000
    ports:
      - "8000:8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  etcd:
    image: quay.io/coreos/etcd:v3.5.5
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    volumes:
      - etcd:/etcd

  minio:
    image: minio/minio:RELEASE.2023-03-13T19-46-17Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    command: minio server /minio_data
    volumes:
      - minio:/minio_data

volumes:
  etcd:
  minio:
  milvus:

Set --gpu-memory-utilization 0.85 in vLLM to leave VRAM headroom for the Milvus CAGRA index. The KNOWHERE_GPU_MEM_POOL_SIZE of 1024MB reserves 1GB for CAGRA; increase this to match your index size.

For agent-specific memory (Mem0, Zep) on the same node, see agent memory infrastructure guide. For vLLM configuration details, see vLLM production deployment 2026.

Cost Economics: On-Demand vs Reserved Pricing for Vector DB Workloads

Three representative deployment scenarios, using live prices where available.

Scenario	GPU	On-Demand Price	Monthly (730hr)	Managed alternative
Dev / 1M vectors	L40S 48GB	$1.80/hr	~$1,314/mo	Pinecone Starter: free, then $70+/mo
Production RAG / 10M vectors	RTX PRO 6000 96GB	$1.70/hr	~$1,241/mo	Pinecone Standard: ~$500-2000/mo
Enterprise / H100 SXM5	H100 SXM5 80GB	$2.90/hr (on-demand) or $0.80/hr (spot)	~$2,117/mo on-demand, ~$584/mo spot	OpenAI Embeddings + Pinecone: $2000-8000/mo

The H100 SXM5 on-demand rate is $2.90/hr per GPU as of 29 Apr 2026, with spot pricing at $0.80/hr per GPU. For always-on vector database instances, reserved pricing reduces costs further. The vector DB itself runs on CPU/NVMe; the GPU cost covers the co-located embedding server and (optionally) the LLM inference node.

Beyond the raw hardware cost, managed vector databases add per-query charges ($0.008-0.04 per 1K queries on Pinecone) and data transfer fees that accumulate quickly at scale.

For broader GPU cost strategies, see the GPU cost optimization playbook. For production reliability patterns, checkpointing, health monitoring, and failover, see production GPU cloud architecture.

Co-locating Qdrant, Milvus, or Weaviate with your LLM inference stack on the same GPU node removes the biggest source of RAG latency: the network round trip. Spheron provides on-demand and reserved H100, B200, RTX PRO 6000, and L40S instances where you can run the full vector search and inference stack without split-cloud egress costs.
Rent H100 on Spheron | Rent RTX PRO 6000 | View all GPU pricing
Get started on Spheron

Why Vector Search Needs GPUs in 2026

GPU Vectorization

GPU-Accelerated ANN Indexing: CAGRA

GPU-IVF and GPU-Flat Search

CAGRA vs HNSW vs DiskANN

Choosing Between Qdrant, Milvus, and Weaviate

GPU and VRAM Sizing for Vector Search Workloads

Qdrant Production Deployment on GPU Cloud

Docker Deployment and Persistent Storage

HNSW Tuning for Production

Scalar and Binary Quantization

Sharding for Large Collections

Production docker-compose with TEI Co-location

Milvus 2.5 with NVIDIA CAGRA

Prerequisites

docker-compose Configuration

Creating a CAGRA Index

Search Parameters for CAGRA

GPU-IVF-PQ for Large Datasets

Sharding and Replica Strategy

Weaviate Production Deployment

Docker Compose with GPU Vectorizer

Choosing the Vectorizer Model

Hybrid Search Configuration

HNSW Index Parameters

Benchmark Results: Index Build Time and Query Latency

Co-locating Vector DB with LLM Inference for Sub-50ms RAG

Architecture Overview

VRAM Budget on H100 SXM5 (80GB)

Latency Math for Sub-50ms RAG

Single docker-compose for the Full Stack

Cost Economics: On-Demand vs Reserved Pricing for Vector DB Workloads

Build what's next.