Tutorial

Deploy NVIDIA NeMo Retriever on GPU Cloud: Enterprise RAG with Multilingual Embeddings and Citation-Aware Reranking (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderJun 3, 2026
NeMo Retriever DeploymentNVIDIA NeMo Retriever GPUNVIDIA RAG MicroserviceNeMo Retriever vs TEINeMo Retriever Multilingual EmbeddingCitation-Aware RerankingEnterprise RAG GPU CloudNIM EmbeddingH100GPU Cloud
Deploy NVIDIA NeMo Retriever on GPU Cloud: Enterprise RAG with Multilingual Embeddings and Citation-Aware Reranking (2026 Guide)

A generic TEI deployment gives you fast embeddings. NeMo Retriever gives you fast embeddings with TensorRT-LLM-compiled kernels, a citation-aware reranker, and multilingual support across 26 languages, all in two Docker containers that pass NVIDIA AI Enterprise support entitlements. If your RAG system needs to cite sources at the passage level for compliance, handle non-English documents, or exceed what a PyTorch-backed server can deliver on H100 hardware, NeMo Retriever is the stack to run. This guide covers the full deployment on Spheron bare-metal GPU cloud. For context on deploying other NIM containers, the NVIDIA NIM deployment guide covers the broader NIM catalog including LLM NIMs.

One naming note before diving in: NeMo Retriever is specifically the retrieval NIM family (embedding + reranker). It is distinct from NeMo Curator (data curation), NeMo Guardrails (safety rails), and the broader NeMo training framework. This post covers the retrieval NIM stack only.

What NeMo Retriever Is (and What It Is Not)

NeMo Retriever ships two NIM containers designed to work together:

Embedding NIMs:

  • nv-embedqa-e5-v5 for English-only workloads. Faster, smaller, optimized for English BEIR benchmarks.
  • llama-3.2-nv-embedqa-1b-v2 for multilingual workloads. 26 languages, up to 2048-dimensional vectors (Matryoshka-configurable to 384/512/768/1024/2048), built on a multilingual Llama checkpoint.

Reranker NIM:

  • nv-rerankqa-mistral-4b-v3: a 4B parameter cross-encoder reranker fine-tuned for passage relevance scoring. Returns per-passage attribution data alongside relevance scores.

What NeMo Retriever is NOT: it is not a vector database, not a document parser, and not a full LLM. It sits in the middle of your RAG pipeline, between your document chunker and your vector store on the indexing side, and between your ANN retrieval and your LLM on the query side.

The core difference from a TEI embedding and reranker deployment is the backend. TEI uses Flash Attention 2 and PyTorch. NeMo Retriever uses TensorRT-LLM compiled kernels, which are GPU-architecture specific and compiled at container start time. On an H100 SXM5, that compilation gives you roughly 1.8x embedding throughput over TEI with the same model class. The tradeoff is flexibility: TEI runs any Hugging Face embedding model; NeMo Retriever runs NVIDIA's curated set.

NeMo Retriever Architecture

The pipeline has two phases:

Indexing phase:

Document corpus
  → Chunker / Document Intelligence layer
  → Embedding NIM (nv-embedqa or multilingual NIM) → vector store (Milvus/Qdrant/Weaviate)

Query phase:

Query
  → Embedding NIM → ANN retrieval → top-K candidates
  → Reranker NIM → top-N ranked passages with attribution
  → LLM (any OpenAI-compatible endpoint)

The vector store is the component you deploy separately. NeMo Retriever does not bundle a vector database. For a full deployment guide covering Milvus CAGRA, HNSW tuning, and DiskANN, see the self-host vector database guide.

Hardware Requirements and GPU Sizing

NIM containers use TensorRT-LLM profiles compiled for specific GPU architectures. Running an H100 NIM on an A100 will fall back to a less-optimized profile; it works, but you lose some throughput. Match your GPU to the target deployment tier from the start.

Use caseMinimum GPURecommended GPUSpheron on-demandNotes
Dev / small corpus (<5M chunks)L40S 48GBL40S 48GB$0.96/hrBoth NIMs fit on one GPU
Medium corpus (5-50M chunks)A100 80GBA100 80GB PCIe$1.48/hrSeparate GPU for reranker at scale
Large corpus (>50M chunks)H100 SXM5H100 SXM5$5.07/hrParallel indexing across 2-4 GPUs
High-throughput online servingH100 SXM5H200 SXM5$5.07/hr (H100) / $5.92/hr (H200)HBM3e bandwidth advantage for embedding

Pricing fluctuates based on GPU availability. The prices above are based on 03 Jun 2026 and may have changed. Check current GPU pricing for live rates.

The reranker NIM (nv-rerankqa-mistral-4b-v3) needs at least 24GB VRAM. On an L40S (48GB), you can colocate both embedding and reranker NIMs on one GPU for dev workloads. For production, separate GPUs prevent the reranker's memory footprint from competing with embedding batch processing.

Deploying NeMo Retriever on Spheron: Step by Step

Prerequisites

  • NGC API key from ngc.nvidia.com (NVIDIA Developer Program membership is free and grants evaluation access on up to 16 GPUs)
  • NVIDIA Container Toolkit installed on your instance
  • Docker and Docker Compose
  • A Spheron bare-metal GPU instance: Spheron H100 instances for production, A100 for medium workloads

NVAIE licensing note: NVIDIA AI Enterprise (NVAIE) is required for production use beyond the Developer Program tier. Self-hosting on Spheron is fully supported under NVAIE. The license is per GPU regardless of which cloud provider the GPU runs on.

Docker Compose Configuration

The cleanest way to run both NIMs is with Docker Compose. Assign each NIM to its own GPU using device_ids.

yaml
version: "3.9"
services:
  nemo-embedding:
    image: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:latest
    # Pin to a specific digest for production stability; :latest changes without notice
    runtime: nvidia
    environment:
      - NGC_API_KEY=${NGC_API_KEY}
      - NIM_MAX_BATCH_SIZE=64
    volumes:
      - /nim-cache:/root/.cache/nim
    ports:
      - "8080:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

  nemo-reranker:
    image: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3:latest
    # Pin to a specific digest for production stability; :latest changes without notice
    runtime: nvidia
    environment:
      - NGC_API_KEY=${NGC_API_KEY}
      - NIM_MAX_BATCH_SIZE=16
    volumes:
      - /nim-cache:/root/.cache/nim
    ports:
      - "8081:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

Start both containers:

bash
export NGC_API_KEY=<your-key>
docker compose up -d
docker logs nemo-embedding -f  # watch for "NIM ready to serve"
docker logs nemo-reranker -f

First start downloads model weights from NGC (~2-7GB per NIM). With a warm cache volume at /nim-cache, subsequent starts skip the download.

Verification

Test the embedding endpoint:

bash
curl -s -X POST http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "nvidia/llama-3.2-nv-embedqa-1b-v2", "input": "What is the GPU memory bandwidth of the H100?"}' \
  | jq '.data[0].embedding | length'
# Expect: 2048

Test the reranker endpoint:

bash
curl -s -X POST http://localhost:8081/v1/ranking \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nv-rerankqa-mistral-4b-v3",
    "query": {"text": "H100 memory bandwidth"},
    "passages": [
      {"text": "The NVIDIA H100 SXM5 delivers 3.35 TB/s HBM3 bandwidth."},
      {"text": "GPU VRAM determines model capacity."}
    ]
  }' | jq '.rankings'
# Expect: array with index and logit score per passage

Helm / Kubernetes Deployment

For Kubernetes, NVIDIA publishes Helm charts on NGC at https://helm.ngc.nvidia.com/nvidia/charts/. Pull the NIM chart for the embedding model:

bash
helm fetch https://helm.ngc.nvidia.com/nvidia/charts/nim-llm-1.0.0.tgz

Override values.yaml to set your GPU resource request and NGC API key secret:

yaml
resources:
  requests:
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

ngcAPIKey:
  secretName: ngc-api-key-secret
  secretKey: NGC_API_KEY

For the full Kubernetes GPU setup including cluster networking and node selectors, see the Kubernetes GPU orchestration guide.

Multilingual Embedding Pipelines

The llama-3.2-nv-embedqa-1b-v2 NIM is the multilingual variant. It handles 26 languages (English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish) and supports cross-lingual retrieval: you can index a French corpus and query it in English without any translation layer. The English-only nv-embedqa-e5-v5 is approximately 20% faster but only handles English text.

When to use each:

  • English-only corpus, latency-sensitive serving: use nv-embedqa-e5-v5
  • Any non-English documents or mixed-language corpus: use llama-3.2-nv-embedqa-1b-v2

Testing multilingual embedding and verifying the vector dimensions:

python
import requests

resp = requests.post(
    "http://localhost:8080/v1/embeddings",
    json={"model": "nvidia/llama-3.2-nv-embedqa-1b-v2", "input": "¿Qué es el aprendizaje automático?"}
)
embedding = resp.json()["data"][0]["embedding"]
print(f"Dimensions: {len(embedding)}")  # expect 2048 (default; Matryoshka truncation configurable to 384/512/768/1024)

The two embedding models produce different vector dimensions: nv-embedqa-e5-v5 outputs 1024-dimensional vectors; llama-3.2-nv-embedqa-1b-v2 outputs up to 2048-dimensional vectors by default (Matryoshka-configurable to 384, 512, 768, or 1024). Both models also support Matryoshka truncation to smaller sizes when index footprint matters. Higher dimensionality improves retrieval quality on complex technical passages at the cost of a larger vector store index.

For a broader look at embedding model selection across vision and text modalities, the multimodal embedding guide covers SigLIP-2, JinaCLIP-v2, and Cohere Embed-v4.

Citation-Aware Reranking: Why It Matters for Enterprise

Standard cross-encoder rerankers score (query, passage) pairs and return a relevance ranking. That is useful but not sufficient for regulated industries.

Citation-aware reranking adds passage-level source attribution to the relevance scores. Each ranked result from the NeMo reranker NIM includes a document.index (matching back to your original passage array) and a logit relevance score. When you store the original document ID and passage byte offset alongside each chunk before indexing, the reranker output becomes a fully traceable citation chain: from query to passage to source document.

This matters for HIPAA, EU AI Act Article 13, and financial services compliance requirements where the LLM's output must be traceable to a specific, citable passage in a source document.

Contrast this with common alternatives:

  • ms-marco-MiniLM cross-encoders: no built-in attribution, document-level only
  • Cohere Rerank API: black-box, no self-hosted option, no passage-level offsets
  • Custom cross-encoders via TEI: passage attribution requires custom pipeline wiring

The NeMo reranker NIM handles this out of the box.

Example reranker API call:

python
import requests

query = "What is the GPU memory bandwidth of the H100?"
passages = [
    "The NVIDIA H100 SXM5 delivers 3.35 TB/s of HBM3 memory bandwidth.",
    "GPU memory bandwidth is critical for embedding throughput.",
    "VRAM capacity determines how large a model you can load."
]

resp = requests.post(
    "http://localhost:8081/v1/ranking",
    json={
        "model": "nvidia/nv-rerankqa-mistral-4b-v3",
        "query": {"text": query},
        "passages": [{"text": p} for p in passages]
    }
)
rankings = resp.json()["rankings"]
for r in rankings:
    print(f"Passage {r['index']}: score {r['logit']:.3f}")

The index field maps back to your original passages array, so you can reconstruct the full source reference from your indexing metadata.

Vector Store Integration

NeMo Retriever is vector-store-agnostic. The embedding NIM exposes an OpenAI-compatible /v1/embeddings endpoint, so any client that supports OpenAI embeddings works with it.

Vector storeBest forNeMo Retriever integration
Milvus 2.5Large scale (>100M vectors), GPU-accelerated CAGRA indexNative Python SDK, set embedding endpoint to NeMo NIM URL
QdrantEase of deployment, payload filtering, sparse vectorsREST API or client-side embedding with NeMo endpoint
WeaviateHybrid search (dense + BM25), schema-basedtext2vec-openai module configured to NeMo NIM endpoint

For a full vector store deployment guide covering Milvus CAGRA, HNSW tuning, and DiskANN:

For Milvus CAGRA configuration, Qdrant payload filtering, and Weaviate hybrid search setup, see the self-host vector database guide.

Throughput and Latency Benchmarks: NeMo Retriever vs TEI vs vLLM Embedding

These figures are based on publicly available NVIDIA NIM benchmark data and community measurements. Conditions: single GPU, 512-token passages, batch size 64 for embedding, batch size 16 for reranking. Your numbers will vary by model, chunk size, and batch configuration.

SetupGPUEmbedding throughputReranking throughputNotes
NeMo Retriever (TRT-LLM)H100 SXM5~180K tok/s~1,200 pairs/sPre-compiled TRT-LLM kernels
Hugging Face TEI (Flash Attn 2)H100 SXM5~100K tok/s~800 pairs/sPyTorch backend
vLLM embedding modeH100 SXM5~60K tok/sN/ANot optimized for embedding; listed for reference
NeMo Retriever (TRT-LLM)A100 80GB PCIe~80K tok/s~700 pairs/sGood baseline for cost-sensitive deployments

The TRT-LLM throughput advantage comes from kernel compilation at container start, not runtime overhead. The first launch compiles the kernels for your specific GPU architecture; every subsequent request uses the pre-compiled path. For more on inference optimization approaches, see the vLLM production deployment guide and the LLM serving optimization guide.

Cost Comparison: Spheron vs NVIDIA NIM API vs Managed Embedding APIs

The table below uses live pricing from the Spheron API fetched on 03 Jun 2026. Monthly GPU cost is ($/hr) × 24 × 30. Per-1M-token cost is derived at 50% GPU utilization.

Baseline: 1B tokens/month, continuous serving

OptionProviderGPUMonthly costPer 1M tokens
Self-hosted NeMo (H100 SXM5)Spheron1x H100 SXM5$3,650$0.016 at ~50% utilization
Self-hosted NeMo (A100 80GB PCIe)Spheron1x A100 80GB PCIe$1,066$0.010 at ~50% utilization
NVIDIA NIM API (cloud functions)NVIDIA NGCN/AVaries, per-token~$0.016/1M (public catalog)
Cohere embed-v4CohereN/A$120$0.12/1M
Voyage AI voyage-3Voyage AIN/A$60$0.060/1M
OpenAI text-embedding-3-largeOpenAIN/A$130$0.130/1M

The per-token economics favor self-hosting at high utilization. A well-utilized A100 at $0.010/1M tokens undercuts every managed API in the table above. The catch is the fixed monthly base: the A100 costs $1,066/month whether you run 1M or 100B tokens through it. The break-even against Cohere ($0.12/1M) is around 9B tokens/month, and against OpenAI ($0.130/1M) around 9B tokens/month. Below those thresholds, managed APIs are cheaper. Above them, self-hosting on bare-metal GPU hardware wins on cost and keeps your embeddings off third-party APIs.

One factor that matters specifically for NeMo Retriever: embedding throughput is bandwidth-bound. Shared or virtualized GPU instances partition PCIe bandwidth across tenants, reducing throughput by 10-30% compared to dedicated hardware. Spheron's bare-metal instances give the embedding NIM dedicated HBM bandwidth, which is what the TRT-LLM kernels are compiled to exploit.

Pricing fluctuates based on GPU availability. The prices above are based on 03 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Production Checklist

  1. GPU memory tuning: set NIM_MAX_BATCH_SIZE based on your chunk size. For 512-token chunks, 64 is a safe starting point for H100. Increase until GPU memory hits 85% under peak load.
  2. Batch sizing for indexing vs serving: use async batch embedding with batch sizes of 128-256 for offline corpus indexing. Use smaller batches (16-32) for online query embedding to keep p99 latency under control.
  3. NGC cache volume: mount /root/.cache/nim to a persistent volume or fast NVMe disk. Model weights are 2-7GB per NIM; cold starts without cache take 2-10 minutes.
  4. Prometheus observability: each NIM container exposes /metrics on the same port. Scrape it with Prometheus and alert on nim_request_duration_p99 and nim_batch_size_avg.
  5. Health checks: add a Docker healthcheck for the /v1/health/ready endpoint before routing traffic. Both NIMs expose this endpoint.
  6. Reranker pair limits: the reranker NIM accepts up to 256 passage pairs per request. For retrieval pipelines returning more than 256 candidates, paginate the reranking call across multiple requests.

For production LLM inference latency budgeting (TTFT and ITL targets for the generation layer), see the LLM inference SLO and latency guide. Embedding and reranking latency tuning is NIM-specific and covered in the production checklist above. To add safety rails to the LLM layer sitting downstream of the reranker, the NeMo Guardrails guide covers the guardrails NIM stack.


NeMo Retriever needs direct GPU memory bandwidth. Bare-metal instances prevent the PCIe sharing that cuts embedding throughput on virtualized GPUs. Spheron's H100 and A100 bare-metal instances let you run the full NIM stack at full hardware speed.

H100 GPU pricing → | A100 on Spheron → | View all GPU pricing →

Deploy NeMo Retriever on Spheron →

STEPS / 06

Quick Setup Guide

  1. Get NGC API key and pull NeMo Retriever NIM containers

    Log in to ngc.nvidia.com, generate an API key, and pull the embedding and reranker NIM containers: docker pull nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:latest and docker pull nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3:latest. Set NGC_API_KEY as an environment variable before running containers.

  2. Provision a GPU instance on Spheron

    Log into app.spheron.ai and select a bare-metal GPU instance. H100 SXM5 80GB for large-corpus indexing (corpus >50M chunks), A100 80GB for embedding serving, or L40S 48GB for dev/test. Bare metal is required for NIM containers - they need direct GPU access without virtualization overhead.

  3. Deploy the embedding NIM and reranker NIM via Docker Compose

    Create a docker-compose.yml with two services: the embedding NIM on port 8080 and the reranker NIM on port 8081, both with --gpus all and volumes for NGC cache. Pass NGC_API_KEY as an environment variable to each service. Run docker compose up -d and monitor docker logs until each container logs 'NIM ready to serve'.

  4. Configure and connect a vector store

    Deploy Milvus, Qdrant, or Weaviate using their Docker images. Set the embedding endpoint in your RAG pipeline to http://localhost:8080/v1/embeddings. Index your document corpus by chunking documents, calling the embedding NIM in batches, and upserting vectors into the vector store. Use batch sizes of 64-128 chunks for optimal throughput.

  5. Wire citation-aware reranking into the retrieval pipeline

    After ANN retrieval from the vector store, call the reranker NIM's /v1/ranking endpoint with the query and top-K candidate passages. The NIM returns a ranked list with per-passage relevance scores. For citation-aware compliance, attach the original document ID and passage offset to each chunk before indexing so the reranker output maps back to a citable source.

  6. Tune batch size and GPU memory for production throughput

    Set MAX_BATCH_SIZE in the NIM container's environment variables. For the embedding NIM, start at 64 and increase until GPU memory is at 80% utilization under peak load. For the reranker NIM, batch size of 16-32 passage pairs is typical. Monitor throughput with the /metrics Prometheus endpoint exposed by each NIM container.

FAQ / 05

Frequently Asked Questions

NeMo Retriever is NVIDIA's family of enterprise-grade RAG NIMs (NVIDIA Inference Microservices) specifically built for embedding and reranking at production scale. Unlike generic embedding servers like TEI, NeMo Retriever containers ship with TensorRT-LLM-compiled kernels for NVIDIA GPUs, a citation-aware reranker, and multilingual embedding models supporting 26 languages out of the box.

The NeMo Retriever embedding NIM (nv-embedqa-e5-v5 or llama-3.2-nv-embedqa-1b-v2) runs on H100, H200, A100 80GB, and L40S. The reranker NIM (nv-rerankqa-mistral-4b-v3) is a 4B parameter model that requires at least 24GB VRAM; an A100 80GB is the minimum recommended GPU for production use. H100 SXM5 is the best choice for high-throughput indexing pipelines.

TEI is a general-purpose inference server for any Hugging Face embedding or reranker model. NeMo Retriever is NVIDIA's curated NIM stack for enterprise RAG: pre-compiled TRT-LLM kernels for NVIDIA GPUs, citation-level passage attribution (not just document-level), NVIDIA AI Enterprise support, and integration with NVIDIA Blueprints. TEI is more flexible; NeMo Retriever is more optimized and enterprise-supported.

Yes. The llama-3.2-nv-embedqa-1b-v2 NIM supports 26 languages using a multilingual checkpoint (including French, Spanish, German, Japanese, Chinese, Arabic, Hindi, and 19 others). Unlike English-only models like nv-embedqa-e5-v5, the multilingual NIM handles cross-lingual retrieval, meaning you can index documents in one language and query in another without translation preprocessing.

For production use, NVAIE licensing is required. NVIDIA Developer Program members can run NeMo Retriever containers for evaluation on up to 16 GPUs for free. The NGC API key grants container access; NVAIE licensing governs production SLA and support entitlements. Self-hosting on a GPU cloud provider like Spheron is fully supported - NVAIE licensing is per GPU regardless of where the GPU runs.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.