Self-Host Embeddings and Rerankers: TEI on GPU Cloud (2026)

If your RAG pipeline calls OpenAI's embedding API at 100M tokens a month, you're paying roughly $2 to $13 every month for that single component. At 1B tokens, that's $20 to $130. Self-hosting the same models using Hugging Face's Text Embeddings Inference (TEI) server on a GPU cloud instance can cut that cost by 5-20x depending on your utilization and which managed API you're replacing.

This guide covers the current embedding and reranker model landscape as of April 2026, two-stage retrieval architecture, Docker-based TEI deployment on Spheron GPUs, and a cost-per-million-token comparison against managed APIs.

Why Self-Host Embeddings in 2026

The math is straightforward. Here's what managed embedding APIs charge per 1M tokens as of April 2026:

Provider	Model	Price per 1M tokens
OpenAI	text-embedding-3-small	$0.020
OpenAI	text-embedding-3-large	$0.130
Cohere	embed-v4	$0.100
Voyage AI	voyage-3	$0.060
Voyage AI	voyage-3-lite	$0.020

At different monthly volumes, managed API costs add up quickly:

Monthly tokens	OpenAI 3-small	OpenAI 3-large	Cohere embed-v4
10M	$0.20	$1.30	$1.00
100M	$2.00	$13.00	$10.00
1B	$20.00	$130.00	$100.00

Pricing based on public documentation as of 20 Apr 2026 and may have changed. Managed API pricing does not include network egress or vector store costs.

A self-hosted BGE-M3 or Qwen3-Embedding on an A100 80GB PCIe at $1.04/hr can process roughly 216M tokens per hour at 60,000 tokens/sec sustained throughput (3,600 seconds times 60,000 tok/s). At 50% utilization, that's $0.0097 per 1M tokens. Even at 25% utilization, you're still at $0.019 per 1M tokens, roughly matching OpenAI's cheapest option while running a significantly better model.

The crossover point is around 50-100M tokens per month for the A100, less if you're comparing against the more expensive managed options.

The Embedding Model Landscape: April 2026

Four model families are worth knowing about for production RAG use:

Qwen3-Embedding (0.6B, 4B, 8B variants): The current MTEB leader for size-to-performance ratio as of April 2026. The 0.6B variant runs on any GPU with 8GB+ VRAM and outperforms models twice its size on multilingual retrieval benchmarks. MTEB scores were evaluated at model release and may update as independent benchmarks appear.

Jina Embeddings v4: Notable for LoRA multi-adapter support, which lets you switch between retrieval, classification, and clustering task representations without loading separate model checkpoints. Supports 128k-token context (128,000 tokens).

BGE-M3: The workhorse for multilingual RAG. Supports dense retrieval, sparse (BM25-style), and multi-vector ColBERT-style representations from a single 0.6B model. Apache 2.0 licensed with no gating. Strong BEIR scores across 18 languages.

BGE-large-en-v1.5: A solid English-only baseline at 335M parameters. If you're English-only and not using Qwen3, this is a stable, well-documented choice.

Model	Params	Max tokens	VRAM required	License
Qwen3-Embedding-0.6B	0.6B	32,768	~2GB	Apache 2.0
Qwen3-Embedding-4B	4B	32,768	~8GB	Apache 2.0
Qwen3-Embedding-8B	8B	32,768	~16GB	Apache 2.0
Jina Embeddings v4	~3.75B	128,000	~8GB	Apache 2.0
BGE-M3	0.6B	8,192	~2.5GB	Apache 2.0
BGE-large-en-v1.5	335M	512	~1.5GB	MIT

Note: VRAM figures are for model weights only. Add working memory and batch buffers in production.

For use cases that also index images, screenshots, or slide decks alongside text, see the multimodal embedding deployment guide covering SigLIP-2, JinaCLIP-v2, and Cohere Embed-v4.

Two-Stage Retrieval Architecture

Single-stage ANN search retrieves fast but imprecisely. A cross-encoder reranker is far more accurate but too slow to score thousands of candidates. The two-stage approach combines both:

Recall stage: Encode the query with your embedding model, run ANN search against your vector index, retrieve top-100 candidates. Fast, uses approximate similarity. Optimizes for recall.
Rerank stage: Pass the query and top-100 (query, passage) pairs to a cross-encoder reranker, get relevance scores, return the top-5 or top-10. Slower per pair, but only runs on 100 candidates. Optimizes for precision.

The embedding recall stage is cheap and scales horizontally. The reranker is the bottleneck for latency but dramatically improves the quality of the final context passed to your LLM.

This two-stage pattern consistently outperforms single-stage retrieval on BEIR benchmarks by 5-15% NDCG@10. For the full RAG stack context, including GPU memory planning and LLM co-location, see the agentic RAG GPU infrastructure guide. If your retrieval corpus is image-heavy, including PDFs, slides, or scanned documents, consider visual document retrieval with ColPali as an alternative to text embeddings.

For GraphRAG pipelines specifically, embedding generation is the lighter GPU stage - the heavier compute sits in entity extraction with a 70B LLM. A single L40S or RTX PRO 6000 handles all embedding needs for most GraphRAG deployments.

The Reranker Landscape: April 2026

Model	Params	Languages	VRAM required
Qwen3-Reranker-0.6B	0.6B	Multilingual	~2GB
Qwen3-Reranker-4B	4B	Multilingual	~8GB
Jina Reranker v2	278M	Multilingual	~1.5GB
BGE Reranker v2-m3	568M	Multilingual	~2.5GB
ms-marco-MiniLM-L-12-v2	33M	English only	~0.5GB

For most production RAG workloads, BGE Reranker v2-m3 is the safe default: solid BEIR scores, multilingual, and well-tested in TEI. Qwen3-Reranker-0.6B outperforms it on instruction-following scenarios where you want to give the reranker explicit guidance about what "relevance" means. ms-marco-MiniLM-L-12-v2 is the right call when latency is the top priority and English is sufficient.

Deploying TEI on Spheron: Step-by-Step

Instance Setup

For production embedding workloads, the A100 80GB PCIe is the right choice: ~1.94TB/s memory bandwidth, 80GB VRAM for large batch caching, and a low per-hour cost. For dev and testing, the RTX 4090 works well and costs less.

GPU pricing on Spheron as of 20 Apr 2026:

GPU	On-demand price
A100 80GB PCIe	from $1.04/hr
H100 PCIe	from $2.01/hr
RTX 4090	from $0.79/hr

Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Provision via app.spheron.ai, SSH into the instance, and verify the GPU. For a step-by-step walkthrough of provisioning your first instance, see the Spheron getting started guide.

bash

nvidia-smi

Then install Docker if not already available:

bash

curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

Docker: Embedding Model

bash

docker run --gpus all --name tei-embed \
  -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id BAAI/bge-m3 \
  --max-batch-tokens 65536 \
  --max-concurrent-requests 512 \
  --dtype float16

Pin the image to a specific version tag (1.9) for reproducibility. The latest tag changes without notice and can break deployments.

Key flags:

--max-batch-tokens: Total tokens across all requests in one batch. Set to 65536 for online serving; go higher (131072+) for offline indexing.
--max-concurrent-requests: Queue depth before requests are rejected. 512 is a reasonable ceiling for a single A100.
--dtype float16: Full float16 precision on Ampere and Hopper. Use bfloat16 if you see NaN issues.
--tokenization-workers: Defaults to number of CPU cores. Reduce if CPU is the bottleneck.

Test the endpoint:

bash

curl http://localhost:8080/embed \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["What is GPU cloud?", "How does RAG work?"]}'

Docker: Cross-Encoder Reranker

bash

docker run --gpus all --name tei-rerank \
  -p 8081:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id BAAI/bge-reranker-v2-m3 \
  --max-batch-tokens 32768 \
  --max-concurrent-requests 128 \
  --dtype float16

The reranker accepts (query, passage) pairs and returns relevance scores. Call /rerank, not /embed:

bash

curl http://localhost:8081/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How does two-stage retrieval work?",
    "texts": [
      "Two-stage retrieval uses a fast embedding recall followed by a cross-encoder reranker.",
      "GPU cloud providers offer H100 and A100 instances.",
      "Vector databases store dense embeddings for ANN search."
    ]
  }'

Jina v4 Deployment with Task Prefix

Jina Embeddings v4 uses task prefixes to switch between representations. The task query parameter tells the model which head to use:

bash

docker run --gpus all --name tei-jina \
  -p 8082:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id jinaai/jina-embeddings-v4 \
  --max-batch-tokens 32768 \
  --dtype float16

Call with task-specific parameters:

bash

# For query embedding
curl "http://localhost:8082/embed?task=retrieval.query" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["my search query"]}'

# For passage/document embedding
curl "http://localhost:8082/embed?task=retrieval.passage" \
  -H "Content-Type: application/json" \
  -d '{"inputs": ["document text to index"]}'

Example Python Client

python

import httpx
import numpy as np

def embed(texts: list[str], url: str = "http://localhost:8080") -> list[list[float]]:
    r = httpx.post(f"{url}/embed", json={"inputs": texts})
    r.raise_for_status()
    return r.json()

def rerank(query: str, passages: list[str], url: str = "http://localhost:8081") -> list[dict]:
    r = httpx.post(f"{url}/rerank", json={"query": query, "texts": passages})
    r.raise_for_status()
    return r.json()

# Two-stage retrieval
def retrieve_and_rerank(
    query: str,
    vector_index,  # your FAISS or other ANN index
    corpus: list[str],
    top_k_recall: int = 100,
    top_k_final: int = 10,
) -> list[str]:
    # Stage 1: embedding recall
    query_vec = embed([query])[0]
    query_arr = np.array([query_vec], dtype=np.float32)
    _, indices = vector_index.search(query_arr, top_k_recall)
    candidates = [corpus[i] for i in indices[0] if i >= 0]

    # Stage 2: reranker precision
    # TEI /rerank returns results sorted by score descending, each with an `index`
    # field pointing to the original position in the candidates list.
    scores = rerank(query, candidates)
    return [candidates[s['index']] for s in scores[:top_k_final]]

Throughput Benchmarks

These numbers are approximate, based on TEI 1.9 with Flash Attention 2 enabled at batch size 512. Actual throughput depends on sequence length, hardware memory bandwidth, and concurrent request patterns.

Embedding Throughput (tokens/sec, batch=512)

Model	GPU	Approximate tokens/sec
BGE-M3	A100 80GB PCIe	~60,000
BGE-M3	H100 PCIe	~90,000
Qwen3-Embedding-0.6B	A100 80GB PCIe	~55,000
Qwen3-Embedding-0.6B	RTX 4090	~20,000

Reranker Throughput (pairs/sec, batch=64)

Model	GPU	Approximate pairs/sec
BGE Reranker v2-m3	A100 80GB PCIe	~800
Jina Reranker v2	A100 80GB PCIe	~600
Qwen3-Reranker-0.6B	A100 80GB PCIe	~700

Latency (p50, online serving at batch=1)

Operation	Approximate latency
Single query embed	5-15ms
Single (query, passage) rerank pair	15-30ms
Full two-stage pipeline (embed + top-100 ANN + rerank-to-10)	30-80ms

For latency-sensitive production workloads, run the embedding model on the same GPU server as your vector index and LLM to eliminate network round trips. See the agentic RAG GPU infrastructure guide for colocation patterns. For deploying Qdrant, Milvus, or Weaviate on the same GPU node as your TEI embedding server, see self-hosting vector databases on GPU cloud.

Cost Per 1M Tokens: Self-Hosted vs Managed

Cost Model for Self-Hosted TEI

At 60,000 tok/s on an A100 PCIe at $1.04/hr:

60,000 tok/s x 3,600 s = 216M tokens/hr
Cost at 100% utilization: $1.04 / 216 = $0.0048 per 1M tokens
Cost at 50% utilization (realistic for inference serving): ~$0.0097 per 1M tokens
Cost at 25% utilization: ~$0.019 per 1M tokens

On an RTX 4090 at $0.79/hr and 20,000 tok/s:

20,000 tok/s x 3,600 s = 72M tokens/hr
Cost at 50% utilization: ~$0.022 per 1M tokens

These figures assume embedding-only GPU utilization. If you run an LLM on the same instance alongside the embedding server, the effective embedding cost per token is lower since the GPU cost is shared.

Comparison Table

Provider	Model	Price per 1M tokens
OpenAI	text-embedding-3-small	$0.020
OpenAI	text-embedding-3-large	$0.130
Cohere	embed-v4	$0.100
Voyage AI	voyage-3	$0.060
Self-hosted A100 PCIe, 50% util	BGE-M3 or Qwen3-Embedding	~$0.010
Self-hosted RTX 4090, 50% util	BGE-M3	~$0.022

Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.

Note: reranking adds a separate GPU cost not included in the embedding figures above. At 800 pairs/sec on an A100 PCIe, the reranker cost is typically small relative to embedding costs unless your pipeline reranks very large candidate sets.

Break-Even Analysis

A 24/7 A100 PCIe rental at $1.04/hr costs $748.80/month. At what monthly volume does that fixed cost break even against managed APIs?

vs OpenAI text-embedding-3-small ($0.020/1M): break-even around 37B tokens/month
vs Cohere embed-v4 ($0.100/1M): break-even around 7.5B tokens/month
vs OpenAI text-embedding-3-large ($0.130/1M): break-even around 5.8B tokens/month

Below these thresholds, managed APIs are cheaper when you factor in operational overhead. Above them, self-hosting wins on cost and gives you model flexibility that managed APIs don't offer.

Monitoring TEI in Production

TEI exposes a Prometheus /metrics endpoint out of the box. Key metrics to track:

te_request_duration_seconds: Per-request latency histogram. Alert if p99 exceeds 100ms for online serving.
te_batch_size_histogram: Distribution of batch sizes. If most batches are size 1, consider batching client requests to improve throughput.
te_queue_size: Number of requests waiting. Alert if queue depth consistently exceeds 20; add capacity or reduce --max-concurrent-requests to trigger earlier backpressure.

Also watch GPU-level metrics:

GPU utilization below 20% with queue depth above 5 signals underprovisioned batch sizes.
GPU memory above 90% risks OOM on large batches. Reduce --max-batch-tokens if you see OOM errors in logs.

For a full GPU monitoring stack with Prometheus and Grafana dashboards, see the GPU monitoring for ML guide.

Serving Qwen3-Embedding-8B and Larger Models

The 8B variant of Qwen3-Embedding delivers the best MTEB scores in the family and handles very long documents well (up to 32,768 tokens). It requires 16GB+ VRAM for model weights plus working memory, so an A100 80GB or H100 is the right GPU.

bash

docker run --gpus all --name tei-qwen3-8b \
  -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:1.9 \
  --model-id Qwen/Qwen3-Embedding-8B \
  --max-batch-tokens 32768 \
  --max-concurrent-requests 256 \
  --dtype float16

Throughput will be lower than the 0.6B variant (roughly 10,000-15,000 tok/s on an A100 80GB) but quality is substantially higher for complex retrieval tasks. For LoRA multi-adapter serving on top of Jina v4, the same GPU sizing rules apply.

For inference routing with embedding-based query classifiers, see the LLM inference router guide. For alternative serving engines, the Triton Inference Server deployment guide covers embedding model serving with TensorRT optimization. For full RAG pipeline context including vector search and LLM colocation, the agentic RAG GPU infrastructure guide covers the complete stack. If you are building a team chat interface that uses these embeddings, the Open WebUI and LibreChat deployment guide covers the full RAG wiring end to end.

For teams running NVIDIA GPUs and needing an enterprise-supported option with citation-aware reranking, NVIDIA NeMo Retriever is the NIM-based alternative to TEI.

Embedding and reranker workloads run cheaply on mid-tier GPUs. A100 and even RTX 4090 cover most production RAG volumes without paying for H100-level compute. Run your own TEI stack on Spheron and cut managed embedding costs at 100M+ tokens/month.
A100 80GB on Spheron → | Check H100 availability → | View all pricing →

STEPS / 05

Quick Setup Guide

Choose your embedding model
Select a model based on your language requirements and VRAM budget. BGE-M3 (0.6B params, multilingual) fits on any GPU with 8GB+ VRAM. Qwen3-Embedding-0.6B is the best-in-class small model as of April 2026. Jina Embeddings v4 adds LoRA multi-adapter support for task-specific tuning without separate model checkpoints. For English-only workloads, BGE-large-en-v1.5 is a proven baseline.
Provision a GPU instance on Spheron
Log into app.spheron.ai, select a GPU (A100 80GB PCIe for production, RTX 4090 for dev/test), choose your region and provider, and SSH in. Run nvidia-smi to confirm the GPU is available. For embedding-only workloads, a single GPU is sufficient. For co-located embedding plus reranker, provision a GPU with at least 24GB VRAM.
Deploy TEI with Docker
Pull the TEI image and launch with your chosen model: docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:1.9 --model-id BAAI/bge-m3 --max-batch-tokens 65536 --max-concurrent-requests 512. Set --max-batch-tokens based on your expected document chunk size times batch size. Use --dtype float16 for maximum throughput on Ampere and Hopper GPUs.
Deploy a cross-encoder reranker
Start a second TEI instance for the reranker model: docker run --gpus all -p 8081:80 ghcr.io/huggingface/text-embeddings-inference:1.9 --model-id cross-encoder/ms-marco-MiniLM-L-12-v2. For higher accuracy, use BAAI/bge-reranker-v2-m3 or jinaai/jina-reranker-v2-base-multilingual. Call the /rerank endpoint with the query and candidate passages to get relevance scores.
Tune batching for throughput
Set --max-batch-tokens to the largest batch you expect in a single request. For offline embedding (indexing), set it high (131072+). For online query embedding, set it lower (4096-16384) to reduce latency. Monitor tokens/second with the /metrics Prometheus endpoint. On an A100 80GB, BGE-M3 achieves 50,000-80,000 tokens/second at batch size 512 with Flash Attention 2 enabled.

FAQ / 05

Frequently Asked Questions

TEI is an open-source inference server from Hugging Face optimized for embedding and reranker models. It supports Flash Attention 2, tensor parallelism, dynamic batching, and gRPC/HTTP APIs. It runs any BERT-architecture or Mistral-based embedding model with production-grade throughput on NVIDIA GPUs.

At 100M tokens/month, OpenAI text-embedding-3-large costs about $13 and Cohere embed-v4 costs about $10. A self-hosted BGE-M3 or Qwen3-Embedding on an A100 80GB at $1.04/hr (Spheron on-demand) produces roughly 216 million tokens per hour at full batch throughput (60,000 tok/s x 3,600 s), making the per-token cost ~$0.0048 per 1M tokens at 100% utilization. At high monthly volumes the savings are substantial.

Embedding workloads are memory-bandwidth-bound, not compute-bound. The A100 80GB PCIe is an excellent choice: ~1.94TB/s memory bandwidth, 80GB VRAM for large batch caching, and low per-hour cost. The RTX 4090 works well for smaller deployments at under $0.80/hr. H100 is overkill for pure embedding serving but makes sense when co-located with an LLM in a RAG pipeline.

Yes, but they require separate TEI instances since each process is locked to one model. Run two containers on the same GPU by splitting CUDA_VISIBLE_DEVICES or using Docker GPU partitioning. For production RAG, it is more reliable to use separate GPU instances for the embedding model and the reranker.

Two-stage retrieval uses a fast approximate nearest-neighbor search (embedding recall) to retrieve the top-100 candidates, then applies a slower but more accurate cross-encoder reranker to re-score and trim to the top-5 or top-10. The embedding recall stage optimizes for coverage; the reranker optimizes for precision. This combination consistently outperforms single-stage retrieval on BEIR benchmarks.

Why Self-Host Embeddings in 2026

The Embedding Model Landscape: April 2026

Two-Stage Retrieval Architecture

The Reranker Landscape: April 2026

Deploying TEI on Spheron: Step-by-Step

Instance Setup

Docker: Embedding Model

Docker: Cross-Encoder Reranker

Jina v4 Deployment with Task Prefix

Example Python Client

Throughput Benchmarks

Embedding Throughput (tokens/sec, batch=512)

Reranker Throughput (pairs/sec, batch=64)

Latency (p50, online serving at batch=1)

Cost Per 1M Tokens: Self-Hosted vs Managed

Cost Model for Self-Hosted TEI

Comparison Table

Break-Even Analysis

Monitoring TEI in Production

Serving Qwen3-Embedding-8B and Larger Models

Related Guides

Quick Setup Guide

Choose your embedding model

Provision a GPU instance on Spheron

Deploy TEI with Docker

Deploy a cross-encoder reranker

Tune batching for throughput

Frequently Asked Questions

01What is Hugging Face Text Embeddings Inference (TEI)?

02How much cheaper is self-hosting embeddings vs OpenAI or Cohere?

03Which GPU is best for embedding model inference?

04Can TEI serve both embeddings and rerankers on the same GPU?

05What is two-stage retrieval and why use a reranker?

Build what's next.