Tutorial

Self-Host Vector Databases on GPU Cloud: Qdrant, Milvus, and Weaviate Production Deployment (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 29, 2026
Self-Host Vector Database GPUQdrant vs Milvus vs WeaviateQdrant Milvus Weaviate DeploymentGPU Accelerated Vector SearchVector Database GPU CloudMilvus CAGRAHNSW TuningMilvus 2.5Open Source Vector DatabaseDiskANNGPU CloudQdrant DeploymentWeaviate DeploymentRAG Pipeline GPU
Self-Host Vector Databases on GPU Cloud: Qdrant, Milvus, and Weaviate Production Deployment (2026)

Most production RAG pipelines split their vector database and LLM inference across two separate services. Every query crosses two network boundaries: one to the vector database (30-250ms p99) and one to the LLM API (100-600ms p99). Self-hosting both on one GPU node removes those boundaries entirely. The RAG pipeline bare metal case study shows a concrete example: one team cut p99 latency from 1.8 seconds to 190ms by co-locating all three components on the same GPU server. For the broader infrastructure context, the agentic RAG GPU infrastructure guide covers GPU memory planning and stack co-location in depth.

This post covers production deployment of three open-source vector databases: Qdrant, Milvus 2.5 (with NVIDIA CAGRA), and Weaviate on Spheron GPU cloud. Topics include CAGRA configuration, HNSW tuning parameters, sharding and replica strategy, co-location with vLLM, and cost economics.

Why Vector Search Needs GPUs in 2026

There are three distinct ways GPUs accelerate vector search workflows. They apply to different databases and different parts of the pipeline.

GPU Vectorization

Embedding throughput on GPU is not a minor improvement. BGE-M3 on an A100 processes roughly 60,000 tokens per second (per TEI benchmark data); H100 SXM5 exceeds this. On CPU, the same model tops out at around 600 tokens/sec. If your RAG pipeline re-encodes queries at every retrieval step, CPU-based embedding is a hard bottleneck at any meaningful concurrency. This benefit applies to all three databases: Qdrant, Milvus, and Weaviate all accept pre-computed vectors, so you can co-locate any GPU embedding server alongside them. For full deployment details on self-hosting embedding models, see self-hosted embeddings with TEI.

GPU-Accelerated ANN Indexing: CAGRA

NVIDIA CAGRA (CUDA ANNS GRAph-based) is a graph-based approximate nearest neighbor algorithm that runs entirely on CUDA. Index build on GPU is dramatically faster than CPU HNSW. For a 10M-vector, 1536-dimension dataset, CAGRA on an H100 SXM5 builds the index in roughly 45 seconds. CPU HNSW takes 18-22 minutes for the same corpus. At query time, CAGRA returns results in under 1ms on GPU versus 3-8ms on CPU HNSW. These numbers come from the NVIDIA cuVS benchmarks and the Milvus 2.5 documentation. Milvus is currently the only open-source vector database with native CAGRA support.

GPU-IVF and GPU-Flat Search

For smaller corpora or situations where exact search is required, GPU-flat (brute-force) search on an H100 returns results in about 2ms for 10M vectors. CPU takes around 150ms for the same operation. GPU-flat is useful for generating recall ground truth and for small corpora where approximate search introduces unacceptable recall loss.

CAGRA vs HNSW vs DiskANN

HNSW is the standard CPU graph-based ANN algorithm. Qdrant, Weaviate, and Milvus all use HNSW as their CPU path. It performs well for datasets under 50M vectors and has broad tooling support. Build time scales roughly O(n log n), which becomes painful above 50M vectors.

CAGRA runs on GPU and is Milvus-specific. It builds the same style of graph as HNSW but on CUDA, 10-50x faster. Queries run on GPU with sub-millisecond p50 latency. The constraint is that the entire index must fit in GPU VRAM. At 1536-dimension float32 vectors, 10M vectors require about 58GB of VRAM for the raw data; the CAGRA graph index adds 20-30% overhead on top of that. This makes CAGRA practical for corpora up to roughly 50M vectors on a single GPU, and larger corpora with multi-GPU sharding.

DiskANN is an NVMe-backed algorithm where the index graph lives on SSD rather than RAM or VRAM. Query latency is higher than CAGRA or in-memory HNSW (5-20ms p50) but memory requirements are orders of magnitude smaller. Milvus 2.5 supports DiskANN natively, making it the practical path for billion-scale corpora that cannot fit in GPU memory.

Choosing Between Qdrant, Milvus, and Weaviate

FeatureQdrantMilvus 2.5Weaviate
GPU ANN indexNo (CPU HNSW only)Yes (CAGRA, GPU-IVF)No (CPU HNSW/flat)
Built-in vectorizerNoNo (external)Yes (GPU modules)
Payload filteringExcellentGoodGood
Sharding modelAutomatic, customManual shardsMulti-tenancy/shards
Kubernetes operatorCommunityOfficialOfficial
LicenseApache 2.0Apache 2.0BSD-3
Server VRAM footprint~200MB~2-4GB~500MB
Recommended tierRTX PRO 6000 / L40SH100 SXM5L40S / RTX PRO 6000

The decision rule is straightforward. If you need GPU-accelerated ANN search and your corpus is between 1M and 100M vectors, use Milvus with CAGRA. If you need strong payload filtering, simple operations, and don't need GPU ANN search, use Qdrant. If you want GPU vectorization handled by the database itself without managing a separate embedding server, use Weaviate with the text2vec-transformers module.

GPU and VRAM Sizing for Vector Search Workloads

Corpus SizeDimensionsRaw float32PQ-compressedRecommended GPUOn-Demand Price
1M vectors1536~6 GB~0.8 GBL40S 48GB$1.80/hr
10M vectors1536~58 GB~8 GBRTX PRO 6000 96GB$1.70/hr
100M vectors1536~580 GB~80 GB8x H100 SXM5 sharded~$23.20/hr
1B vectors768~3 TB~370 GB (PQ)DiskANN + NVMeN/A (disk-backed)

The 100M+ vector case requires either product quantization (which drops recall by 2-5%) or multi-GPU sharding. For billion-scale corpora, DiskANN on NVMe with GPU-cached index graph layers is the practical path. The GPU memory requirements guide covers VRAM math in depth for LLM co-location planning alongside these numbers.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Qdrant Production Deployment on GPU Cloud

Qdrant uses CPU-based HNSW for all ANN search. The GPU benefit for Qdrant comes from co-locating a GPU embedding server on the same node. The GPU handles vectorization; Qdrant handles storage and retrieval.

Docker Deployment and Persistent Storage

bash
docker run -d \
  --name qdrant \
  -p 6333:6333 \
  -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  --restart unless-stopped \
  qdrant/qdrant

Verify the node is healthy:

bash
curl localhost:6333/healthz

HNSW Tuning for Production

Three parameters control HNSW behavior in Qdrant:

  • m (graph connectivity): 16-32 for production. Higher m means more edges per node, better recall, but larger index size. Each additional edge costs roughly m * 4 bytes * vectors in memory.
  • ef_construct (build recall): 100-400. Controls how many candidates are evaluated during index construction. Higher values improve recall at the cost of build time.
  • hnsw_ef (query recall): 64-512. Set at query time via search params. This is the main knob for the recall/latency tradeoff.
Configurationmef_constructhnsw_efUse case
Throughput-optimized1610064High QPS, recall > 0.93 acceptable
Balanced16200128General production
Recall-optimized32400256When recall > 0.99 is required

Create a collection with production HNSW settings:

python
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff, OptimizersConfigDiff

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    hnsw_config=HnswConfigDiff(m=16, ef_construct=200, on_disk=False),
    optimizers_config=OptimizersConfigDiff(memmap_threshold=100000)
)

Scalar and Binary Quantization

Enable scalar quantization to reduce memory usage by 4x with minimal recall loss:

python
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType

client.create_collection(
    collection_name="documents_quantized",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(type=ScalarType.INT8, quantile=0.99, always_ram=True)
    )
)

With rescore=true in search params, recall loss is typically 0-3%. Binary quantization gives 32x memory reduction at higher recall loss (2-8%), suitable for very large corpora where approximate results are acceptable.

Sharding for Large Collections

Qdrant supports distributed mode with configurable shards. Create a collection with explicit sharding:

python
client.create_collection(
    collection_name="large_corpus",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    shard_number=4,           # 4 shards for 10M+ vector collections
    replication_factor=2      # 2 replicas for production availability
)

Rule of thumb: 1 shard per 5M vectors. For collections above 10M vectors, use Qdrant's distributed mode with at least 2 nodes.

Production docker-compose with TEI Co-location

yaml
version: "3.8"
services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - ./qdrant_storage:/qdrant/storage
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 32G

  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:1.9
    command: --model-id BAAI/bge-m3 --port 80
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

Milvus 2.5 with NVIDIA CAGRA

This guide targets Milvus 2.5 because it preserves the pure-GPU CAGRA query path. Milvus 2.6 introduced a hybrid CAGRA architecture (GPU for graph construction, CPU for query serving) which scales differently and may be preferable if you have spare CPU cores but limited GPU VRAM. The 2.5 examples below port directly to 2.6 with the same parameters.

Milvus is the only open-source vector database with native GPU ANN support, via NVIDIA cuVS. CAGRA is the graph-based algorithm; GPU-IVF-PQ is the quantization variant for larger datasets.

Prerequisites

  • CUDA 12.x and NVIDIA drivers 535 or later
  • The GPU-enabled Milvus image: milvusdb/milvus:v2.5.27-gpu (not milvusdb/milvus:v2.5)

Using the non-GPU image causes CAGRA to silently fall back to CPU HNSW with no error. Confirm you have the right image with:

bash
docker run --rm milvusdb/milvus:v2.5.27-gpu cat /milvus/configs/milvus.yaml | grep -i cuda

docker-compose Configuration

yaml
version: "3.8"
services:
  etcd:
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    volumes:
      - etcd:/etcd

  minio:
    image: minio/minio:RELEASE.2023-03-13T19-46-17Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    command: minio server /minio_data --console-address ":9001"
    volumes:
      - minio:/minio_data

  milvus:
    image: milvusdb/milvus:v2.5.27-gpu
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
      KNOWHERE_GPU_MEM_POOL_SIZE: "4096"
      CUDA_VISIBLE_DEVICES: "0"
    ports:
      - "19530:19530"
      - "9091:9091"
    volumes:
      - milvus:/var/lib/milvus
    depends_on:
      - etcd
      - minio
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  etcd:
  minio:
  milvus:

Creating a CAGRA Index

python
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, utility

connections.connect("default", host="localhost", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=4096),
]
schema = CollectionSchema(fields, description="Document corpus")
collection = Collection("documents", schema)

# Insert your vectors
collection.insert([embeddings, texts])
collection.flush()

# Build the CAGRA index
index_params = {
    "index_type": "GPU_CAGRA",
    "metric_type": "IP",
    "params": {
        "intermediate_graph_degree": 128,
        "graph_degree": 64,
        "build_algo": "IVF_PQ",    # faster builds; use NN_DESCENT for higher recall
    }
}
collection.create_index("embedding", index_params)
collection.load()

Search Parameters for CAGRA

python
search_params = {
    "metric_type": "IP",
    "params": {
        "itopk_size": 128,    # 64 for throughput, 256 for maximum recall
        "search_width": 4,
    }
}

results = collection.search(
    data=[query_embedding],
    anns_field="embedding",
    param=search_params,
    limit=10,
    output_fields=["text"]
)

itopk_size is the primary recall/latency lever. At 64, expect p50 latency under 0.5ms and recall around 0.94. At 256, latency increases to 2-3ms but recall reaches 0.99.

GPU-IVF-PQ for Large Datasets

When the CAGRA index won't fit in GPU VRAM (roughly above 50M vectors at 1536 dimensions), switch to GPU-IVF-PQ:

python
index_params = {
    "index_type": "GPU_IVF_PQ",
    "metric_type": "IP",
    "params": {
        "nlist": 4096,
        "m": 32,
        "nbits": 8,
    }
}

Search with nprobe=64 for balanced recall/throughput. GPU-IVF-PQ stores compressed vectors in VRAM, reducing the VRAM footprint by roughly 4-8x at the cost of 3-5% recall loss.

Sharding and Replica Strategy

python
# At collection creation time
collection = Collection(
    "large_corpus",
    schema,
    shards_num=4,         # 1 shard per 50M vectors is a reasonable starting point
)

# Load with replicas for query node redundancy
collection.load(replica_number=2)

For 100M+ vector collections, distribute across 2-4 GPU nodes using Milvus Distributed mode with a dedicated query node pool.

Weaviate Production Deployment

Weaviate's GPU acceleration comes through the vectorizer module tier, not ANN search. The text2vec-transformers module runs on GPU; the HNSW index runs on CPU. This is the opposite of Milvus: Weaviate offloads vectorization to GPU, Milvus offloads indexing and search.

Docker Compose with GPU Vectorizer

Pin to a specific minor in production; track the Weaviate releases page for upgrades.

yaml
version: "3.8"
services:
  weaviate:
    image: cr.weaviate.io/semitechnologies/weaviate:1.32
    ports:
      - "8080:8080"
      - "50051:50051"
    environment:
      QUERY_DEFAULTS_LIMIT: "25"
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      ENABLE_MODULES: "text2vec-transformers"
      TRANSFORMERS_INFERENCE_API: "http://t2v-transformers:8080"
      DEFAULT_VECTORIZER_MODULE: "text2vec-transformers"
      CLUSTER_HOSTNAME: "node1"
    volumes:
      - weaviate:/var/lib/weaviate

  t2v-transformers:
    image: cr.weaviate.io/semitechnologies/transformers-inference:sentence-transformers-all-mpnet-base-v2
    environment:
      ENABLE_CUDA: "1"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  weaviate:

Choosing the Vectorizer Model

ModelParamsLatency on GPUVRAMNotes
all-MiniLM-L6-v222M~6ms/batch~150MBLowest VRAM, good for dev
all-mpnet-base-v2109M~10ms/batch~450MBBalanced quality and speed
Qwen3-Embedding-0.6B (custom)600M~3ms/batch on H100~1.2GBStrong multilingual recall

For higher-quality embeddings not available as Weaviate modules, run a separate TEI server and pass pre-computed vectors directly. See self-host embedding and reranker with TEI for the TEI setup.

Hybrid Search Configuration

Weaviate hybrid search combines vector similarity (ANN) with BM25 keyword search. Configure the schema class with both enabled:

python
import weaviate
import weaviate.classes.config as wvc

client = weaviate.connect_to_local(host="localhost", port=8080)

client.collections.create(
    name="Document",
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_transformers(),
    properties=[
        wvc.Property(
            name="content",
            data_type=wvc.DataType.TEXT,
        )
    ],
    replication_config=wvc.Configure.replication(factor=2),
    inverted_index_config=wvc.Configure.inverted_index(bm25_b=0.75, bm25_k1=1.2),
)

Query with hybrid search using alpha=0.75 to weight vector similarity over BM25:

python
collection = client.collections.get("Document")
results = collection.query.hybrid(
    query="your search query",
    alpha=0.75,
    limit=10,
)

HNSW Index Parameters

Weaviate uses the same HNSW algorithm as Qdrant. Production settings:

python
client.collections.create(
    name="Document",
    vector_index_config=wvc.Configure.VectorIndex.hnsw(
        ef=256,
        ef_construction=200,
        max_connections=64,
        vector_cache_max_objects=2000000,
    ),
)

ef is the query-time parameter (equivalent to Qdrant's hnsw_ef). Increase from the default 100 to 256 for production recall targets above 0.97.

Benchmark Results: Index Build Time and Query Latency

Results below are based on reported benchmarks from the Milvus and cuVS projects. Actual performance varies by dataset characteristics and host configuration.

GPUCorpusIndex TypeBuild Timep50 Latencyp99 LatencyQPS (top-10)Recall@10
H100 SXM510M x 1536CAGRA~45s0.6ms1.2ms12,0000.97
RTX PRO 600010M x 1536CAGRA~80s0.9ms1.8ms8,0000.97
L40S (int8)10M x 1536CAGRA~95s1.1ms2.2ms6,5000.96
CPU (32-core)10M x 1536HNSW~22min3.8ms8.5ms1,2000.95

At 10M x 1536 float32, the raw data is ~58GB plus CAGRA graph overhead of 20-30%, totaling roughly 70-75GB. The L40S (48GB) cannot fit this in VRAM at float32, so the L40S results above use int8 quantization (reducing VRAM to ~14-15GB). The RTX PRO 6000 (96GB) and H100 SXM5 (80GB) both fit the float32 index with headroom for co-located workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 29 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Co-locating Vector DB with LLM Inference for Sub-50ms RAG

Managed RAG pipelines (Pinecone + OpenAI API) have two external service calls per query: one to the vector database and one to the LLM. Each adds 30-250ms network latency. Self-hosting both on one GPU server makes both calls local, over shared memory rather than a network interface.

Architecture Overview

A single-node co-located stack runs:

  • vLLM on port 8000 for LLM inference
  • Milvus on port 19530 (management) and 19121 (GPU search)
  • TEI embedding server on port 8080

All services communicate over localhost. Network latency between components is effectively 0ms.

VRAM Budget on H100 SXM5 (80GB)

ComponentModel/IndexVRAM
LLM (vLLM)Llama 3.3 70B FP8~70 GB
Vector index (Milvus CAGRA)1M x 1536 float32~6 GB
Embedding server (TEI)BGE-M3~1 GB
Total~77 GB

This leaves about 3GB for CUDA kernel overhead. For larger indexes, use RTX PRO 6000 (96GB) or switch to GPU-IVF-PQ with int8 quantization to reduce index VRAM.

Latency Math for Sub-50ms RAG

Breaking down the 50ms budget:

StepLocal (same GPU node)Remote (separate services)
Query embedding1-3ms100-400ms p99
Vector search0.5-2ms30-250ms p99
LLM TTFT (H100 SXM5)15-25ms100-600ms p99
Network overhead0ms30-250ms
Total17-30ms260-1500ms

The full local round-trip is 17-30ms. The same query routed through external APIs is 260-1500ms. The difference comes entirely from eliminating network hops.

Single docker-compose for the Full Stack

yaml
version: "3.8"
services:
  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:1.9
    command: --model-id BAAI/bge-m3 --port 80
    ports:
      - "8080:80"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  milvus:
    image: milvusdb/milvus:v2.5.27-gpu
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
      KNOWHERE_GPU_MEM_POOL_SIZE: "1024"
      CUDA_VISIBLE_DEVICES: "0"
    ports:
      - "19530:19530"
    volumes:
      - milvus:/var/lib/milvus
    depends_on:
      - etcd
      - minio
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  vllm:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.3-70B-Instruct
      --dtype fp8
      --gpu-memory-utilization 0.85
      --port 8000
    ports:
      - "8000:8000"
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  etcd:
    image: quay.io/coreos/etcd:v3.5.5
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    volumes:
      - etcd:/etcd

  minio:
    image: minio/minio:RELEASE.2023-03-13T19-46-17Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    command: minio server /minio_data
    volumes:
      - minio:/minio_data

volumes:
  etcd:
  minio:
  milvus:

Set --gpu-memory-utilization 0.85 in vLLM to leave VRAM headroom for the Milvus CAGRA index. The KNOWHERE_GPU_MEM_POOL_SIZE of 1024MB reserves 1GB for CAGRA; increase this to match your index size.

For agent-specific memory (Mem0, Zep) on the same node, see agent memory infrastructure guide. For vLLM configuration details, see vLLM production deployment 2026.

Cost Economics: On-Demand vs Reserved Pricing for Vector DB Workloads

Three representative deployment scenarios, using live prices where available.

ScenarioGPUOn-Demand PriceMonthly (730hr)Managed alternative
Dev / 1M vectorsL40S 48GB$1.80/hr~$1,314/moPinecone Starter: free, then $70+/mo
Production RAG / 10M vectorsRTX PRO 6000 96GB$1.70/hr~$1,241/moPinecone Standard: ~$500-2000/mo
Enterprise / H100 SXM5H100 SXM5 80GB$2.90/hr (on-demand) or $0.80/hr (spot)~$2,117/mo on-demand, ~$584/mo spotOpenAI Embeddings + Pinecone: $2000-8000/mo

The H100 SXM5 on-demand rate is $2.90/hr per GPU as of 29 Apr 2026, with spot pricing at $0.80/hr per GPU. For always-on vector database instances, reserved pricing reduces costs further. The vector DB itself runs on CPU/NVMe; the GPU cost covers the co-located embedding server and (optionally) the LLM inference node.

Beyond the raw hardware cost, managed vector databases add per-query charges ($0.008-0.04 per 1K queries on Pinecone) and data transfer fees that accumulate quickly at scale.

For broader GPU cost strategies, see the GPU cost optimization playbook. For production reliability patterns, checkpointing, health monitoring, and failover, see production GPU cloud architecture.

Co-locating Qdrant, Milvus, or Weaviate with your LLM inference stack on the same GPU node removes the biggest source of RAG latency: the network round trip. Spheron provides on-demand and reserved H100, B200, RTX PRO 6000, and L40S instances where you can run the full vector search and inference stack without split-cloud egress costs.

Rent H100 on Spheron | Rent RTX PRO 6000 | View all GPU pricing

Get started on Spheron

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.