Engineering

Agentic RAG on GPU Cloud: Deploy Embedding, Vector Search, and LLM on One Stack (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 10, 2026
Agentic RAGRAGGPU CloudVector SearchLLM InferenceAI AgentsEmbedding ModelsH100L40SGPU Infrastructure
Agentic RAG on GPU Cloud: Deploy Embedding, Vector Search, and LLM on One Stack (2026)

Most agentic RAG latency problems aren't in the retrieval logic. They're in the three network round-trips between your embedding API, your vector database, and your LLM API. Fix the hardware layer and the latency problem largely solves itself. The RAG pipeline bare metal case study shows exactly this: one company cut p99 latency from 1.8 seconds to 190ms by colocating all three components on the same GPU server. For the VRAM math behind agent workloads specifically, the GPU infrastructure for AI agents guide covers memory planning in detail.

This guide focuses on the infrastructure layer: how to plan GPU memory, what components to colocate, and how to get the full embedding, vector search, and LLM stack running on one node.

What Is Agentic RAG and Why It Needs Dedicated GPU Infrastructure

Standard RAG does one thing: retrieve relevant documents, inject them into context, generate a response. One retrieval pass. One LLM call.

Agentic RAG is different. The model decides what to retrieve, evaluates the results, and may retrieve again if the context is insufficient. A typical agentic RAG loop runs 3-7 LLM calls per user turn: initial planning, retrieval, re-ranking, reflection, and possibly a follow-up retrieval pass before generating the final answer.

At 300ms per LLM call via a managed API, three calls is 900ms before you've generated a single output token. Seven calls is 2.1 seconds. That's before network latency to your embedding API (100-400ms p99) and your vector database (30-250ms p99).

GPU colocation solves this. When your embedding model, vector index, and LLM all live on the same GPU server:

  • Query encoding drops from 100-400ms to 2-5ms
  • Vector search drops from 30-250ms to 1-3ms
  • LLM TTFT drops from 600-1500ms to 30-80ms

The round-trip overhead disappears entirely because there are no round trips. Components communicate over local memory, not the network.

For multi-agent systems where many agents share the same retrieval infrastructure, see multi-agent AI system GPU infrastructure for orchestration patterns.

The GPU-Accelerated RAG Stack: Three Components, One Node

Embedding Model

Embedding throughput on GPU is not a minor improvement over CPU. Stella_en_1.5B_v5 on an H100 encodes tens of thousands of sentences per second. On CPU, that number is closer to 500. If your agentic loop re-encodes the query at each retrieval step, CPU-based embedding becomes a hard bottleneck at any meaningful concurrency.

Model options by VRAM requirement:

ModelParamsVRAM (FP16)Notes
BGE-M30.5B~1 GBMultilingual, good for hybrid search
stella_en_1.5B_v51.5B~3 GBStrong benchmark scores, low VRAM
E5-mistral-7b-instruct7B~14 GBBest quality, expensive in VRAM

VRAM figures above are model weight storage estimates. Actual inference VRAM will be higher due to KV cache, activations, and framework overhead.

Since vLLM v0.6.4, you can serve embedding models directly through vLLM using --task embed. This gives you a unified serving stack for both embeddings and generation, using the same OpenAI-compatible API surface.

GPU-Accelerated Vector Search

Two main options:

FAISS-GPU (faiss-gpu-cu12): in-memory approximate nearest neighbor search on CUDA. For 10M vectors at 1536 dimensions, FAISS-GPU on an H100 returns results in single-digit milliseconds. The same search on CPU takes around 80ms. FAISS is purely in-memory, so it's the fastest option for real-time serving.

VRAM for FAISS is deterministic: vectors × dimensions × 4 bytes. For 1M vectors at 768 dimensions: 1,000,000 × 768 × 4 = 3 GB. Add roughly 20% for GPU runtime overhead when sizing your instance.

Milvus with GPU indexing: persistent vector database with optional GPU-accelerated index building. Index build on GPU is 10-50x faster than CPU. Good for multi-tenant workloads or corpora that update frequently, at the cost of higher operational complexity.

For most agentic RAG deployments with a single corpus under 50M vectors, FAISS-GPU in-memory is the right choice. For multi-tenant or frequently-updated corpora, Milvus with GPU indexing.

LLM Inference Server

vLLM is the default: OpenAI-compatible API, continuous batching, PagedAttention for KV cache management. For agentic RAG specifically, you need a model with reliable structured output and function-calling support. Llama 3.3 70B Instruct and Qwen3-8B both work well. Mistral 7B v0.3 is a solid choice if VRAM is tight.

For the serving setup details, see LLM serving optimization: continuous batching and PagedAttention. For VRAM sizing by model, see the GPU memory requirements for LLMs guide.

GPU Memory Planning: Fitting the Full Stack on One Node

This is where most agentic RAG deployments get into trouble. Each component looks manageable in isolation. Together, they compete for the same VRAM pool.

Full VRAM breakdown by component:

ComponentModel ExampleVRAM (FP16)VRAM (FP8/INT8)
Embeddingstella_en_1.5B_v53 GB1.5 GB
EmbeddingE5-mistral-7b14 GB7 GB
Vector IndexFAISS-GPU, 1M vectors (768-dim)~3 GBN/A
Vector IndexFAISS-GPU, 5M vectors (1536-dim)~31 GBN/A
Vector IndexFAISS-GPU, 10M vectors (1536-dim)~61 GBN/A
LLMLlama 3.1 8B FP88 GB8 GB
LLMMistral 7B v0.314 GB7 GB
LLMLlama 3.3 70B FP8~70 GB~70 GB
KV Cache buffer50 concurrent agentic sessions10-20 GBvaries

VRAM figures for embedding and LLM rows are model weight storage estimates. Actual inference VRAM is higher due to KV cache, activations, and framework overhead.

Three practical configurations:

Option 1: L40S 48GB, budget stack

  • Embedding: stella_en_1.5B_v5 FP16 (3 GB)
  • Vector index: FAISS-GPU, up to 1M vectors at 768-dim (~3 GB)
  • LLM: Llama 3.1 8B FP8 (8 GB)
  • Committed: ~14 GB. Remaining ~34 GB for KV cache.
  • Best for: up to 20 concurrent agents, small-to-mid corpora (under 1M documents)

Option 2: H100 80GB, single GPU standard stack

  • Embedding: stella_en_1.5B_v5 FP16 (3 GB)
  • Vector index: FAISS-GPU, up to 5M vectors at 1536-dim (~31 GB)
  • LLM: Llama 3.1 8B FP8 (8 GB)
  • Committed: ~42 GB. Remaining ~38 GB for KV cache.
  • Best for: up to 50 concurrent agents, enterprise corpora (under 5M documents)

Option 3: 4x H100 80GB, production stack with large corpus

  • Embedding: E5-mistral-7b FP16 on GPU-0 (14 GB)
  • Vector index: FAISS-GPU, 10M vectors at 1536-dim on GPU-1 (~61 GB, dedicated GPU)
  • LLM: Llama 3.3 70B FP8, tensor-parallel across GPU-2 and GPU-3 (~35 GB per GPU)
  • GPU-0 committed: ~14 GB. Remaining ~66 GB for activations and embedding KV cache.
  • GPU-1 committed: ~61 GB. Remaining ~19 GB covers FAISS runtime overhead (~20%).
  • GPU-2/GPU-3 committed: ~35 GB each. Remaining ~45 GB per GPU for LLM KV cache.
  • Best for: high-concurrency production, large corpora (up to 10M documents)

FAISS is isolated to its own GPU because 10M vectors at 1536-dim (~61 GB) plus E5-mistral-7b FP16 (14 GB) would total ~75 GB before FAISS runtime overhead — leaving insufficient headroom on a single 80 GB card.

NVIDIA Agentic RAG Toolkit: What It Is and How to Use It

NVIDIA released an official agentic RAG reference architecture as part of their AI Blueprint program in 2025-2026. The toolkit bundles NVIDIA NIM microservices for LLM and embedding inference, NVIDIA cuVS (a GPU-accelerated vector search library integrated into FAISS since v1.10) for vector search, and LangChain/LlamaIndex integration layers.

The architecture: NIM serves both the embedding model and LLM via OpenAI-compatible APIs. cuVS handles vector search on the GPU. The integration layer wires them together with agent orchestration.

To run this on Spheron, pull the NIM containers, provision an H100 instance, and configure cuVS as the vector backend. See the NVIDIA NIM self-hosted deployment guide for the step-by-step NIM setup.

One important note: NVIDIA NIM requires NVIDIA AI Enterprise licensing for production use. For teams running open models without enterprise licensing, the open-source stack below (vLLM + FAISS-GPU) avoids that overhead entirely and matches or exceeds NIM performance on most workloads.

Step-by-Step: Deploy an Agentic RAG Pipeline on Spheron

Step 1: Provision a GPU Instance

Choose your GPU based on the VRAM table above. H100 PCIe for a standard stack, L40S for budget. Provision via the Spheron console at app.spheron.ai or the CLI.

bash
# H100 PCIe gives you 80GB VRAM and full root access
# L40S gives you 48GB VRAM at roughly 3x lower hourly cost

Step 2: Install the Stack

bash
pip install vllm faiss-gpu-cu12 sentence-transformers langchain-community

Note: faiss-gpu-cu12 is the correct package for CUDA 12.x. The generic faiss-gpu package may silently install CPU-only FAISS depending on your environment. Check your CUDA version with nvcc --version before installing.

Step 3: Start the Embedding Server

bash
vllm serve dunzhang/stella_en_1.5B_v5 \
  --task embed \
  --port 8001 \
  --dtype bfloat16

This exposes an OpenAI-compatible embeddings endpoint at http://localhost:8001/v1/embeddings. You can test it immediately:

bash
curl http://localhost:8001/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "dunzhang/stella_en_1.5B_v5", "input": "test query"}'

Step 4: Build the FAISS-GPU Index

python
import faiss
import numpy as np

d = 1536  # embedding dimension
res = faiss.StandardGpuResources()

# IndexFlatIP for cosine similarity (use normalized vectors)
index_flat = faiss.IndexFlatIP(d)
gpu_index = faiss.index_cpu_to_gpu(res, 0, index_flat)

# Add document vectors (replace with your actual embeddings)
vectors = np.load("document_embeddings.npy").astype("float32")
faiss.normalize_L2(vectors)
gpu_index.add(vectors)

print(f"Index ready: {gpu_index.ntotal} vectors on GPU")

For persistence between restarts, serialize the index to CPU, save to disk, and reload:

python
# Save
cpu_index = faiss.index_gpu_to_cpu(gpu_index)
faiss.write_index(cpu_index, "vectors.index")

# Load (on next start)
cpu_index = faiss.read_index("vectors.index")
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)

Step 5: Start the LLM Server

bash
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --dtype fp8 \
  --port 8000 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.90

--enable-auto-tool-choice and --tool-call-parser llama3_json enable structured tool calls, which agentic multi-hop retrieval depends on. --enable-prefix-caching reuses KV cache for shared system prompts across agent turns, cutting prefill cost significantly for multi-hop queries.

Step 6: Wire the Agentic Layer

A minimal Python example using LangChain:

python
import requests
import faiss
import numpy as np
from langchain_openai import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage

EMBED_URL = "http://localhost:8001/v1/embeddings"
LLM_URL = "http://localhost:8000/v1"
EMBED_MODEL = "dunzhang/stella_en_1.5B_v5"
LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct"

def encode_query(text: str) -> np.ndarray:
    resp = requests.post(
        f"{EMBED_URL}",
        json={"model": EMBED_MODEL, "input": text},
        headers={"Content-Type": "application/json"},
    )
    resp.raise_for_status()
    vec = np.array(resp.json()["data"][0]["embedding"], dtype="float32")
    faiss.normalize_L2(vec.reshape(1, -1))
    return vec.reshape(1, -1)

def retrieve(query: str, gpu_index, doc_texts: list, k: int = 5) -> list[str]:
    vec = encode_query(query)
    _, indices = gpu_index.search(vec, k)
    return [doc_texts[i] for i in indices[0] if i != -1]

llm = ChatOpenAI(
    base_url=LLM_URL,
    model=LLM_MODEL,
    api_key="ignored",  # vLLM does not require auth by default
)

def agentic_rag(user_query: str, gpu_index, doc_texts: list) -> str:
    # First retrieval pass
    chunks = retrieve(user_query, gpu_index, doc_texts)
    context = "\n\n".join(chunks)

    messages = [
        SystemMessage(content="You are a helpful assistant. Use the provided context to answer questions. If the context is insufficient, say so."),
        HumanMessage(content=f"Context:\n{context}\n\nQuestion: {user_query}"),
    ]
    return llm.invoke(messages).content

For multi-hop retrieval, extend this to loop: after the first LLM call, check if the model signals insufficient context (via a tool call or a structured flag), then retrieve again with a refined query.

Step 7: Verify Latency Baseline

bash
# Time a single query end-to-end
time curl -s -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 50,
    "stream": false
  }'

Target benchmarks for a healthy colocated stack:

  • Embedding latency: under 5ms per query
  • FAISS search (1M vectors): under 3ms
  • LLM TTFT (70B FP8, short prompt): under 80ms
  • Full single-hop pipeline: under 120ms
  • Full two-hop agentic pipeline: under 200ms

Latency Optimization: Hitting Sub-200ms Time to First Token

Here's why colocated GPU beats managed APIs on latency at every percentile:

ComponentManaged API (p99)Colocated GPU (p99)
Query embedding200-400ms2-5ms
Vector search50-250ms1-3ms
LLM TTFT (8B FP8)600-1500ms30-80ms
Total (sequential)850-2150ms33-88ms

Four optimizations that push you further:

1. Async embedding and retrieval. Start FAISS search as soon as the embedding result is ready. Don't wait on an unrelated pipeline step. Python asyncio with httpx handles this cleanly.

2. Prefill batching. vLLM's --max-num-batched-tokens controls how many tokens are processed per scheduler step. Set to 8192 or 16384 for high-throughput workloads where multiple agent turns arrive simultaneously.

3. KV cache prefix reuse. For agentic workloads where many requests share the same system prompt, --enable-prefix-caching in vLLM eliminates repeated prefill computation. On a typical 70B model with a 2K-token system prompt, this cuts TTFT by 30-50% for every subsequent request after the first.

4. Speculative decoding. For agentic structured output patterns (JSON tool calls, fixed-format responses), a smaller draft model accelerates generation. vLLM supports this natively.

For a deep dive on these vLLM parameters, see continuous batching and PagedAttention optimization.

Scaling Patterns: Multi-Agent RAG with Concurrent GPU Sessions

VRAM limits concurrent sessions through KV cache saturation, not compute capacity. The formula:

max_sessions = available_VRAM / (2 × n_layers × n_kv_heads × head_dim × context_length × precision_bytes)

For Llama 3.3 70B FP8 on 2x H100 with ~53GB KV cache headroom: roughly 80-120 concurrent sessions before KV cache eviction kicks in. Llama 3.3 70B uses grouped-query attention (GQA) with n_kv_heads=8 (not the 64 query heads), so use n_kv_heads in this formula.

Three scaling patterns:

Single large GPU. One H100 serves all components. Simple to operate, one failure point. Right choice for under 50 concurrent agents.

GPU per component. Dedicated GPU for embedding, dedicated GPU for FAISS, dedicated GPU(s) for the LLM. Each component scales independently. Embedding and FAISS are IO-bound, LLM is compute-bound, so they benefit from independent scaling levers.

Horizontal LLM scaling. Multiple vLLM instances behind a load balancer, each serving the same model. Embedding and FAISS remain on a single shared GPU since their VRAM footprint is modest and they're fast. The LLM tier scales horizontally to match agent concurrency.

For the orchestration layer on top of any of these patterns, see multi-agent AI system GPU infrastructure.

Cost Analysis: Colocated GPU Stack vs Separate Managed APIs

Managed API baseline per 1 million RAG queries (assuming 1K tokens of retrieved context, 256-token query encoding, 500-token LLM output):

ServiceBasisCost per 1M queries
Embedding (OpenAI text-embedding-3-small)$0.02/1M tokens, 256 tokens/query$5
Vector DB (Pinecone serverless)~$0.08/1M reads$80
LLM (GPT-4o-mini, 1K input + 500 output)$0.15/$0.60 per M tokens~$525
Total~$610 per 1M queries

Spheron colocated GPU stack:

At 10 queries/second sustained throughput (a conservative number for a single H100 running an 8B model):

GPUHourly rateQueries/hrCost per 1M queries
L40S PCIe (on-demand)$0.72~18,000~$40
H100 PCIe (on-demand)$2.11~36,000~$59
H100 PCIe (on-demand, 70B FP8)$2.11~10,800~$195

The 70B model comparison is the most apples-to-apples comparison against GPT-4o-mini equivalent quality. Even at that throughput, colocated GPU is roughly 3x cheaper per query than managed APIs. At higher concurrency (50+ agents), the gap widens because managed APIs price per query while your GPU cost stays fixed.

Pricing fluctuates based on GPU availability. The prices above are based on 10 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

For a broader cost analysis framework, see AI inference cost economics 2026 and serverless GPU vs on-demand vs reserved.


Running a RAG pipeline at scale means paying per query to three separate APIs, or owning the full stack on dedicated GPU hardware. Spheron's bare-metal H100 and L40S instances let you colocate your embedding model, vector index, and LLM on one node, cutting round-trip latency by up to 20x and cost by 60-70% compared to managed services.

Rent H100 → | Rent L40S → | View all GPU pricing →

Deploy your RAG stack on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.