Tutorial

Self-Host Embeddings and Rerankers: TEI on GPU Cloud (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 20, 2026
Self-Hosted EmbeddingsText Embeddings InferenceRAGRerankerBGE-M3Jina EmbeddingsQwen3 EmbeddingTwo-Stage RetrievalGPU CloudLLM Infrastructure
Self-Host Embeddings and Rerankers: TEI on GPU Cloud (2026)

If your RAG pipeline calls OpenAI's embedding API at 100M tokens a month, you're paying roughly $2 to $13 every month for that single component. At 1B tokens, that's $20 to $130. Self-hosting the same models using Hugging Face's Text Embeddings Inference (TEI) server on a GPU cloud instance can cut that cost by 5-20x depending on your utilization and which managed API you're replacing.

This guide covers the current embedding and reranker model landscape as of April 2026, two-stage retrieval architecture, Docker-based TEI deployment on Spheron GPUs, and a cost-per-million-token comparison against managed APIs.

Why Self-Host Embeddings in 2026

The math is straightforward. Here's what managed embedding APIs charge per 1M tokens as of April 2026:

ProviderModelPrice per 1M tokens
OpenAItext-embedding-3-small$0.020
OpenAItext-embedding-3-large$0.130
Cohereembed-v4$0.100
Voyage AIvoyage-3$0.060
Voyage AIvoyage-3-lite$0.020

At different monthly volumes, managed API costs add up quickly:

Monthly tokensOpenAI 3-smallOpenAI 3-largeCohere embed-v4
10M$0.20$1.30$1.00
100M$2.00$13.00$10.00
1B$20.00$130.00$100.00

Pricing based on public documentation as of 20 Apr 2026 and may have changed. Managed API pricing does not include network egress or vector store costs.

A self-hosted BGE-M3 or Qwen3-Embedding on an A100 80GB PCIe at $1.04/hr can process roughly 216M tokens per hour at 60,000 tokens/sec sustained throughput (3,600 seconds times 60,000 tok/s). At 50% utilization, that's $0.0097 per 1M tokens. Even at 25% utilization, you're still at $0.019 per 1M tokens, roughly matching OpenAI's cheapest option while running a significantly better model.

The crossover point is around 50-100M tokens per month for the A100, less if you're comparing against the more expensive managed options.

The Embedding Model Landscape: April 2026

Four model families are worth knowing about for production RAG use:

Qwen3-Embedding (0.6B, 4B, 8B variants): The current MTEB leader for size-to-performance ratio as of April 2026. The 0.6B variant runs on any GPU with 8GB+ VRAM and outperforms models twice its size on multilingual retrieval benchmarks. MTEB scores were evaluated at model release and may update as independent benchmarks appear.

Jina Embeddings v4: Notable for LoRA multi-adapter support, which lets you switch between retrieval, classification, and clustering task representations without loading separate model checkpoints. Supports 128k-token context (128,000 tokens).

BGE-M3: The workhorse for multilingual RAG. Supports dense retrieval, sparse (BM25-style), and multi-vector ColBERT-style representations from a single 0.6B model. Apache 2.0 licensed with no gating. Strong BEIR scores across 18 languages.

BGE-large-en-v1.5: A solid English-only baseline at 335M parameters. If you're English-only and not using Qwen3, this is a stable, well-documented choice.

ModelParamsMax tokensVRAM requiredLicense
Qwen3-Embedding-0.6B0.6B32,768~2GBApache 2.0
Qwen3-Embedding-4B4B32,768~8GBApache 2.0
Qwen3-Embedding-8B8B32,768~16GBApache 2.0
Jina Embeddings v4~3.75B128,000~8GBApache 2.0
BGE-M30.6B8,192~2.5GBApache 2.0
BGE-large-en-v1.5335M512~1.5GBMIT

Note: VRAM figures are for model weights only. Add working memory and batch buffers in production.

Two-Stage Retrieval Architecture

Single-stage ANN search retrieves fast but imprecisely. A cross-encoder reranker is far more accurate but too slow to score thousands of candidates. The two-stage approach combines both:

  1. Recall stage: Encode the query with your embedding model, run ANN search against your vector index, retrieve top-100 candidates. Fast, uses approximate similarity. Optimizes for recall.
  2. Rerank stage: Pass the query and top-100 (query, passage) pairs to a cross-encoder reranker, get relevance scores, return the top-5 or top-10. Slower per pair, but only runs on 100 candidates. Optimizes for precision.

The embedding recall stage is cheap and scales horizontally. The reranker is the bottleneck for latency but dramatically improves the quality of the final context passed to your LLM.

This two-stage pattern consistently outperforms single-stage retrieval on BEIR benchmarks by 5-15% NDCG@10. For the full RAG stack context, including GPU memory planning and LLM co-location, see the agentic RAG GPU infrastructure guide.

The Reranker Landscape: April 2026

ModelParamsLanguagesVRAM required
Qwen3-Reranker-0.6B0.6BMultilingual~2GB
Qwen3-Reranker-4B4BMultilingual~8GB
Jina Reranker v2278MMultilingual~1.5GB
BGE Reranker v2-m3568MMultilingual~2.5GB
ms-marco-MiniLM-L-12-v233MEnglish only~0.5GB

For most production RAG workloads, BGE Reranker v2-m3 is the safe default: solid BEIR scores, multilingual, and well-tested in TEI. Qwen3-Reranker-0.6B outperforms it on instruction-following scenarios where you want to give the reranker explicit guidance about what "relevance" means. ms-marco-MiniLM-L-12-v2 is the right call when latency is the top priority and English is sufficient.

Deploying TEI on Spheron: Step-by-Step

Instance Setup

For production embedding workloads, the A100 80GB PCIe is the right choice: ~1.94TB/s memory bandwidth, 80GB VRAM for large batch caching, and a low per-hour cost. For dev and testing, the RTX 4090 works well and costs less.

GPU pricing on Spheron as of 20 Apr 2026:

GPUOn-demand price
A100 80GB PCIefrom $1.04/hr
H100 PCIefrom $2.01/hr
RTX 4090from $0.79/hr

Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Provision via app.spheron.ai, SSH into the instance, and verify the GPU. For a step-by-step walkthrough of provisioning your first instance, see the Spheron getting started guide.

bash
nvidia-smi

Then install Docker if not already available:

bash
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

Docker: Embedding Model

bash
docker run --gpus all --name tei-embed \
  -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id BAAI/bge-m3 \
  --max-batch-tokens 65536 \
  --max-concurrent-requests 512 \
  --dtype float16

Pin the image to a specific version tag (1.9) for reproducibility. The latest tag changes without notice and can break deployments.

Key flags:

  • --max-batch-tokens: Total tokens across all requests in one batch. Set to 65536 for online serving; go higher (131072+) for offline indexing.
  • --max-concurrent-requests: Queue depth before requests are rejected. 512 is a reasonable ceiling for a single A100.
  • --dtype float16: Full float16 precision on Ampere and Hopper. Use bfloat16 if you see NaN issues.
  • --tokenization-workers: Defaults to number of CPU cores. Reduce if CPU is the bottleneck.

Test the endpoint:

bash
curl http://localhost:8080/embed \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["What is GPU cloud?", "How does RAG work?"]}'

Docker: Cross-Encoder Reranker

bash
docker run --gpus all --name tei-rerank \
  -p 8081:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id BAAI/bge-reranker-v2-m3 \
  --max-batch-tokens 32768 \
  --max-concurrent-requests 128 \
  --dtype float16

The reranker accepts (query, passage) pairs and returns relevance scores. Call /rerank, not /embed:

bash
curl http://localhost:8081/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How does two-stage retrieval work?",
    "texts": [
      "Two-stage retrieval uses a fast embedding recall followed by a cross-encoder reranker.",
      "GPU cloud providers offer H100 and A100 instances.",
      "Vector databases store dense embeddings for ANN search."
    ]
  }'

Jina v4 Deployment with Task Prefix

Jina Embeddings v4 uses task prefixes to switch between representations. The task query parameter tells the model which head to use:

bash
docker run --gpus all --name tei-jina \
  -p 8082:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id jinaai/jina-embeddings-v4 \
  --max-batch-tokens 32768 \
  --dtype float16

Call with task-specific parameters:

bash
# For query embedding
curl "http://localhost:8082/embed?task=retrieval.query" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["my search query"]}'

# For passage/document embedding
curl "http://localhost:8082/embed?task=retrieval.passage" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["document text to index"]}'

Example Python Client

python
import httpx
import numpy as np

def embed(texts: list[str], url: str = "http://localhost:8080") -> list[list[float]]:
    r = httpx.post(f"{url}/embed", json={"inputs": texts})
    r.raise_for_status()
    return r.json()

def rerank(query: str, passages: list[str], url: str = "http://localhost:8081") -> list[dict]:
    r = httpx.post(f"{url}/rerank", json={"query": query, "texts": passages})
    r.raise_for_status()
    return r.json()

# Two-stage retrieval
def retrieve_and_rerank(
    query: str,
    vector_index,  # your FAISS or other ANN index
    corpus: list[str],
    top_k_recall: int = 100,
    top_k_final: int = 10,
) -> list[str]:
    # Stage 1: embedding recall
    query_vec = embed([query])[0]
    query_arr = np.array([query_vec], dtype=np.float32)
    _, indices = vector_index.search(query_arr, top_k_recall)
    candidates = [corpus[i] for i in indices[0] if i >= 0]

    # Stage 2: reranker precision
    # TEI /rerank returns results sorted by score descending, each with an `index`
    # field pointing to the original position in the candidates list.
    scores = rerank(query, candidates)
    return [candidates[s['index']] for s in scores[:top_k_final]]

Throughput Benchmarks

These numbers are approximate, based on TEI 1.9 with Flash Attention 2 enabled at batch size 512. Actual throughput depends on sequence length, hardware memory bandwidth, and concurrent request patterns.

Embedding Throughput (tokens/sec, batch=512)

ModelGPUApproximate tokens/sec
BGE-M3A100 80GB PCIe~60,000
BGE-M3H100 PCIe~90,000
Qwen3-Embedding-0.6BA100 80GB PCIe~55,000
Qwen3-Embedding-0.6BRTX 4090~20,000

Reranker Throughput (pairs/sec, batch=64)

ModelGPUApproximate pairs/sec
BGE Reranker v2-m3A100 80GB PCIe~800
Jina Reranker v2A100 80GB PCIe~600
Qwen3-Reranker-0.6BA100 80GB PCIe~700

Latency (p50, online serving at batch=1)

OperationApproximate latency
Single query embed5-15ms
Single (query, passage) rerank pair15-30ms
Full two-stage pipeline (embed + top-100 ANN + rerank-to-10)30-80ms

For latency-sensitive production workloads, run the embedding model on the same GPU server as your vector index and LLM to eliminate network round trips. See the agentic RAG GPU infrastructure guide for colocation patterns.

Cost Per 1M Tokens: Self-Hosted vs Managed

Cost Model for Self-Hosted TEI

At 60,000 tok/s on an A100 PCIe at $1.04/hr:

  • 60,000 tok/s x 3,600 s = 216M tokens/hr
  • Cost at 100% utilization: $1.04 / 216 = $0.0048 per 1M tokens
  • Cost at 50% utilization (realistic for inference serving): ~$0.0097 per 1M tokens
  • Cost at 25% utilization: ~$0.019 per 1M tokens

On an RTX 4090 at $0.79/hr and 20,000 tok/s:

  • 20,000 tok/s x 3,600 s = 72M tokens/hr
  • Cost at 50% utilization: ~$0.022 per 1M tokens

These figures assume embedding-only GPU utilization. If you run an LLM on the same instance alongside the embedding server, the effective embedding cost per token is lower since the GPU cost is shared.

Comparison Table

ProviderModelPrice per 1M tokens
OpenAItext-embedding-3-small$0.020
OpenAItext-embedding-3-large$0.130
Cohereembed-v4$0.100
Voyage AIvoyage-3$0.060
Self-hosted A100 PCIe, 50% utilBGE-M3 or Qwen3-Embedding~$0.010
Self-hosted RTX 4090, 50% utilBGE-M3~$0.022

Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Note: reranking adds a separate GPU cost not included in the embedding figures above. At 800 pairs/sec on an A100 PCIe, the reranker cost is typically small relative to embedding costs unless your pipeline reranks very large candidate sets.

Break-Even Analysis

A 24/7 A100 PCIe rental at $1.04/hr costs $748.80/month. At what monthly volume does that fixed cost break even against managed APIs?

  • vs OpenAI text-embedding-3-small ($0.020/1M): break-even around 37B tokens/month
  • vs Cohere embed-v4 ($0.100/1M): break-even around 7.5B tokens/month
  • vs OpenAI text-embedding-3-large ($0.130/1M): break-even around 5.8B tokens/month

Below these thresholds, managed APIs are cheaper when you factor in operational overhead. Above them, self-hosting wins on cost and gives you model flexibility that managed APIs don't offer.

Monitoring TEI in Production

TEI exposes a Prometheus /metrics endpoint out of the box. Key metrics to track:

  • te_request_duration_seconds: Per-request latency histogram. Alert if p99 exceeds 100ms for online serving.
  • te_batch_size_histogram: Distribution of batch sizes. If most batches are size 1, consider batching client requests to improve throughput.
  • te_queue_size: Number of requests waiting. Alert if queue depth consistently exceeds 20; add capacity or reduce --max-concurrent-requests to trigger earlier backpressure.

Also watch GPU-level metrics:

  • GPU utilization below 20% with queue depth above 5 signals underprovisioned batch sizes.
  • GPU memory above 90% risks OOM on large batches. Reduce --max-batch-tokens if you see OOM errors in logs.

For a full GPU monitoring stack with Prometheus and Grafana dashboards, see the GPU monitoring for ML guide.

Serving Qwen3-Embedding-8B and Larger Models

The 8B variant of Qwen3-Embedding delivers the best MTEB scores in the family and handles very long documents well (up to 32,768 tokens). It requires 16GB+ VRAM for model weights plus working memory, so an A100 80GB or H100 is the right GPU.

bash
docker run --gpus all --name tei-qwen3-8b \
  -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id Qwen/Qwen3-Embedding-8B \
  --max-batch-tokens 32768 \
  --max-concurrent-requests 256 \
  --dtype float16

Throughput will be lower than the 0.6B variant (roughly 10,000-15,000 tok/s on an A100 80GB) but quality is substantially higher for complex retrieval tasks. For LoRA multi-adapter serving on top of Jina v4, the same GPU sizing rules apply.

Related Guides

For inference routing with embedding-based query classifiers, see the LLM inference router guide. For alternative serving engines, the Triton Inference Server deployment guide covers embedding model serving with TensorRT optimization. For full RAG pipeline context including vector search and LLM colocation, the agentic RAG GPU infrastructure guide covers the complete stack.


Embedding and reranker workloads run cheaply on mid-tier GPUs. A100 and even RTX 4090 cover most production RAG volumes without paying for H100-level compute. Run your own TEI stack on Spheron and cut managed embedding costs at 100M+ tokens/month.

Rent A100 → | Rent H100 → | View all pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.