If your RAG pipeline calls OpenAI's embedding API at 100M tokens a month, you're paying roughly $2 to $13 every month for that single component. At 1B tokens, that's $20 to $130. Self-hosting the same models using Hugging Face's Text Embeddings Inference (TEI) server on a GPU cloud instance can cut that cost by 5-20x depending on your utilization and which managed API you're replacing.
This guide covers the current embedding and reranker model landscape as of April 2026, two-stage retrieval architecture, Docker-based TEI deployment on Spheron GPUs, and a cost-per-million-token comparison against managed APIs.
Why Self-Host Embeddings in 2026
The math is straightforward. Here's what managed embedding APIs charge per 1M tokens as of April 2026:
| Provider | Model | Price per 1M tokens |
|---|---|---|
| OpenAI | text-embedding-3-small | $0.020 |
| OpenAI | text-embedding-3-large | $0.130 |
| Cohere | embed-v4 | $0.100 |
| Voyage AI | voyage-3 | $0.060 |
| Voyage AI | voyage-3-lite | $0.020 |
At different monthly volumes, managed API costs add up quickly:
| Monthly tokens | OpenAI 3-small | OpenAI 3-large | Cohere embed-v4 |
|---|---|---|---|
| 10M | $0.20 | $1.30 | $1.00 |
| 100M | $2.00 | $13.00 | $10.00 |
| 1B | $20.00 | $130.00 | $100.00 |
Pricing based on public documentation as of 20 Apr 2026 and may have changed. Managed API pricing does not include network egress or vector store costs.
A self-hosted BGE-M3 or Qwen3-Embedding on an A100 80GB PCIe at $1.04/hr can process roughly 216M tokens per hour at 60,000 tokens/sec sustained throughput (3,600 seconds times 60,000 tok/s). At 50% utilization, that's $0.0097 per 1M tokens. Even at 25% utilization, you're still at $0.019 per 1M tokens, roughly matching OpenAI's cheapest option while running a significantly better model.
The crossover point is around 50-100M tokens per month for the A100, less if you're comparing against the more expensive managed options.
The Embedding Model Landscape: April 2026
Four model families are worth knowing about for production RAG use:
Qwen3-Embedding (0.6B, 4B, 8B variants): The current MTEB leader for size-to-performance ratio as of April 2026. The 0.6B variant runs on any GPU with 8GB+ VRAM and outperforms models twice its size on multilingual retrieval benchmarks. MTEB scores were evaluated at model release and may update as independent benchmarks appear.
Jina Embeddings v4: Notable for LoRA multi-adapter support, which lets you switch between retrieval, classification, and clustering task representations without loading separate model checkpoints. Supports 128k-token context (128,000 tokens).
BGE-M3: The workhorse for multilingual RAG. Supports dense retrieval, sparse (BM25-style), and multi-vector ColBERT-style representations from a single 0.6B model. Apache 2.0 licensed with no gating. Strong BEIR scores across 18 languages.
BGE-large-en-v1.5: A solid English-only baseline at 335M parameters. If you're English-only and not using Qwen3, this is a stable, well-documented choice.
| Model | Params | Max tokens | VRAM required | License |
|---|---|---|---|---|
| Qwen3-Embedding-0.6B | 0.6B | 32,768 | ~2GB | Apache 2.0 |
| Qwen3-Embedding-4B | 4B | 32,768 | ~8GB | Apache 2.0 |
| Qwen3-Embedding-8B | 8B | 32,768 | ~16GB | Apache 2.0 |
| Jina Embeddings v4 | ~3.75B | 128,000 | ~8GB | Apache 2.0 |
| BGE-M3 | 0.6B | 8,192 | ~2.5GB | Apache 2.0 |
| BGE-large-en-v1.5 | 335M | 512 | ~1.5GB | MIT |
Note: VRAM figures are for model weights only. Add working memory and batch buffers in production.
Two-Stage Retrieval Architecture
Single-stage ANN search retrieves fast but imprecisely. A cross-encoder reranker is far more accurate but too slow to score thousands of candidates. The two-stage approach combines both:
- Recall stage: Encode the query with your embedding model, run ANN search against your vector index, retrieve top-100 candidates. Fast, uses approximate similarity. Optimizes for recall.
- Rerank stage: Pass the query and top-100 (query, passage) pairs to a cross-encoder reranker, get relevance scores, return the top-5 or top-10. Slower per pair, but only runs on 100 candidates. Optimizes for precision.
The embedding recall stage is cheap and scales horizontally. The reranker is the bottleneck for latency but dramatically improves the quality of the final context passed to your LLM.
This two-stage pattern consistently outperforms single-stage retrieval on BEIR benchmarks by 5-15% NDCG@10. For the full RAG stack context, including GPU memory planning and LLM co-location, see the agentic RAG GPU infrastructure guide.
The Reranker Landscape: April 2026
| Model | Params | Languages | VRAM required |
|---|---|---|---|
| Qwen3-Reranker-0.6B | 0.6B | Multilingual | ~2GB |
| Qwen3-Reranker-4B | 4B | Multilingual | ~8GB |
| Jina Reranker v2 | 278M | Multilingual | ~1.5GB |
| BGE Reranker v2-m3 | 568M | Multilingual | ~2.5GB |
| ms-marco-MiniLM-L-12-v2 | 33M | English only | ~0.5GB |
For most production RAG workloads, BGE Reranker v2-m3 is the safe default: solid BEIR scores, multilingual, and well-tested in TEI. Qwen3-Reranker-0.6B outperforms it on instruction-following scenarios where you want to give the reranker explicit guidance about what "relevance" means. ms-marco-MiniLM-L-12-v2 is the right call when latency is the top priority and English is sufficient.
Deploying TEI on Spheron: Step-by-Step
Instance Setup
For production embedding workloads, the A100 80GB PCIe is the right choice: ~1.94TB/s memory bandwidth, 80GB VRAM for large batch caching, and a low per-hour cost. For dev and testing, the RTX 4090 works well and costs less.
GPU pricing on Spheron as of 20 Apr 2026:
| GPU | On-demand price |
|---|---|
| A100 80GB PCIe | from $1.04/hr |
| H100 PCIe | from $2.01/hr |
| RTX 4090 | from $0.79/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Provision via app.spheron.ai, SSH into the instance, and verify the GPU. For a step-by-step walkthrough of provisioning your first instance, see the Spheron getting started guide.
nvidia-smiThen install Docker if not already available:
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USERDocker: Embedding Model
docker run --gpus all --name tei-embed \
-p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id BAAI/bge-m3 \
--max-batch-tokens 65536 \
--max-concurrent-requests 512 \
--dtype float16Pin the image to a specific version tag (1.9) for reproducibility. The latest tag changes without notice and can break deployments.
Key flags:
--max-batch-tokens: Total tokens across all requests in one batch. Set to 65536 for online serving; go higher (131072+) for offline indexing.--max-concurrent-requests: Queue depth before requests are rejected. 512 is a reasonable ceiling for a single A100.--dtype float16: Full float16 precision on Ampere and Hopper. Usebfloat16if you see NaN issues.--tokenization-workers: Defaults to number of CPU cores. Reduce if CPU is the bottleneck.
Test the endpoint:
curl http://localhost:8080/embed \
-H "Content-Type: application/json" \
-d '{"inputs": ["What is GPU cloud?", "How does RAG work?"]}'Docker: Cross-Encoder Reranker
docker run --gpus all --name tei-rerank \
-p 8081:80 \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id BAAI/bge-reranker-v2-m3 \
--max-batch-tokens 32768 \
--max-concurrent-requests 128 \
--dtype float16The reranker accepts (query, passage) pairs and returns relevance scores. Call /rerank, not /embed:
curl http://localhost:8081/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "How does two-stage retrieval work?",
"texts": [
"Two-stage retrieval uses a fast embedding recall followed by a cross-encoder reranker.",
"GPU cloud providers offer H100 and A100 instances.",
"Vector databases store dense embeddings for ANN search."
]
}'Jina v4 Deployment with Task Prefix
Jina Embeddings v4 uses task prefixes to switch between representations. The task query parameter tells the model which head to use:
docker run --gpus all --name tei-jina \
-p 8082:80 \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id jinaai/jina-embeddings-v4 \
--max-batch-tokens 32768 \
--dtype float16Call with task-specific parameters:
# For query embedding
curl "http://localhost:8082/embed?task=retrieval.query" \
-H "Content-Type: application/json" \
-d '{"inputs": ["my search query"]}'
# For passage/document embedding
curl "http://localhost:8082/embed?task=retrieval.passage" \
-H "Content-Type: application/json" \
-d '{"inputs": ["document text to index"]}'Example Python Client
import httpx
import numpy as np
def embed(texts: list[str], url: str = "http://localhost:8080") -> list[list[float]]:
r = httpx.post(f"{url}/embed", json={"inputs": texts})
r.raise_for_status()
return r.json()
def rerank(query: str, passages: list[str], url: str = "http://localhost:8081") -> list[dict]:
r = httpx.post(f"{url}/rerank", json={"query": query, "texts": passages})
r.raise_for_status()
return r.json()
# Two-stage retrieval
def retrieve_and_rerank(
query: str,
vector_index, # your FAISS or other ANN index
corpus: list[str],
top_k_recall: int = 100,
top_k_final: int = 10,
) -> list[str]:
# Stage 1: embedding recall
query_vec = embed([query])[0]
query_arr = np.array([query_vec], dtype=np.float32)
_, indices = vector_index.search(query_arr, top_k_recall)
candidates = [corpus[i] for i in indices[0] if i >= 0]
# Stage 2: reranker precision
# TEI /rerank returns results sorted by score descending, each with an `index`
# field pointing to the original position in the candidates list.
scores = rerank(query, candidates)
return [candidates[s['index']] for s in scores[:top_k_final]]Throughput Benchmarks
These numbers are approximate, based on TEI 1.9 with Flash Attention 2 enabled at batch size 512. Actual throughput depends on sequence length, hardware memory bandwidth, and concurrent request patterns.
Embedding Throughput (tokens/sec, batch=512)
| Model | GPU | Approximate tokens/sec |
|---|---|---|
| BGE-M3 | A100 80GB PCIe | ~60,000 |
| BGE-M3 | H100 PCIe | ~90,000 |
| Qwen3-Embedding-0.6B | A100 80GB PCIe | ~55,000 |
| Qwen3-Embedding-0.6B | RTX 4090 | ~20,000 |
Reranker Throughput (pairs/sec, batch=64)
| Model | GPU | Approximate pairs/sec |
|---|---|---|
| BGE Reranker v2-m3 | A100 80GB PCIe | ~800 |
| Jina Reranker v2 | A100 80GB PCIe | ~600 |
| Qwen3-Reranker-0.6B | A100 80GB PCIe | ~700 |
Latency (p50, online serving at batch=1)
| Operation | Approximate latency |
|---|---|
| Single query embed | 5-15ms |
| Single (query, passage) rerank pair | 15-30ms |
| Full two-stage pipeline (embed + top-100 ANN + rerank-to-10) | 30-80ms |
For latency-sensitive production workloads, run the embedding model on the same GPU server as your vector index and LLM to eliminate network round trips. See the agentic RAG GPU infrastructure guide for colocation patterns.
Cost Per 1M Tokens: Self-Hosted vs Managed
Cost Model for Self-Hosted TEI
At 60,000 tok/s on an A100 PCIe at $1.04/hr:
- 60,000 tok/s x 3,600 s = 216M tokens/hr
- Cost at 100% utilization: $1.04 / 216 = $0.0048 per 1M tokens
- Cost at 50% utilization (realistic for inference serving): ~$0.0097 per 1M tokens
- Cost at 25% utilization: ~$0.019 per 1M tokens
On an RTX 4090 at $0.79/hr and 20,000 tok/s:
- 20,000 tok/s x 3,600 s = 72M tokens/hr
- Cost at 50% utilization: ~$0.022 per 1M tokens
These figures assume embedding-only GPU utilization. If you run an LLM on the same instance alongside the embedding server, the effective embedding cost per token is lower since the GPU cost is shared.
Comparison Table
| Provider | Model | Price per 1M tokens |
|---|---|---|
| OpenAI | text-embedding-3-small | $0.020 |
| OpenAI | text-embedding-3-large | $0.130 |
| Cohere | embed-v4 | $0.100 |
| Voyage AI | voyage-3 | $0.060 |
| Self-hosted A100 PCIe, 50% util | BGE-M3 or Qwen3-Embedding | ~$0.010 |
| Self-hosted RTX 4090, 50% util | BGE-M3 | ~$0.022 |
Pricing fluctuates based on GPU availability. The prices above are based on 20 Apr 2026 and may have changed. Check current GPU pricing for live rates.
Note: reranking adds a separate GPU cost not included in the embedding figures above. At 800 pairs/sec on an A100 PCIe, the reranker cost is typically small relative to embedding costs unless your pipeline reranks very large candidate sets.
Break-Even Analysis
A 24/7 A100 PCIe rental at $1.04/hr costs $748.80/month. At what monthly volume does that fixed cost break even against managed APIs?
- vs OpenAI text-embedding-3-small ($0.020/1M): break-even around 37B tokens/month
- vs Cohere embed-v4 ($0.100/1M): break-even around 7.5B tokens/month
- vs OpenAI text-embedding-3-large ($0.130/1M): break-even around 5.8B tokens/month
Below these thresholds, managed APIs are cheaper when you factor in operational overhead. Above them, self-hosting wins on cost and gives you model flexibility that managed APIs don't offer.
Monitoring TEI in Production
TEI exposes a Prometheus /metrics endpoint out of the box. Key metrics to track:
te_request_duration_seconds: Per-request latency histogram. Alert if p99 exceeds 100ms for online serving.te_batch_size_histogram: Distribution of batch sizes. If most batches are size 1, consider batching client requests to improve throughput.te_queue_size: Number of requests waiting. Alert if queue depth consistently exceeds 20; add capacity or reduce--max-concurrent-requeststo trigger earlier backpressure.
Also watch GPU-level metrics:
- GPU utilization below 20% with queue depth above 5 signals underprovisioned batch sizes.
- GPU memory above 90% risks OOM on large batches. Reduce
--max-batch-tokensif you see OOM errors in logs.
For a full GPU monitoring stack with Prometheus and Grafana dashboards, see the GPU monitoring for ML guide.
Serving Qwen3-Embedding-8B and Larger Models
The 8B variant of Qwen3-Embedding delivers the best MTEB scores in the family and handles very long documents well (up to 32,768 tokens). It requires 16GB+ VRAM for model weights plus working memory, so an A100 80GB or H100 is the right GPU.
docker run --gpus all --name tei-qwen3-8b \
-p 8080:80 \
ghcr.io/huggingface/text-embeddings-inference:1.9 \
--model-id Qwen/Qwen3-Embedding-8B \
--max-batch-tokens 32768 \
--max-concurrent-requests 256 \
--dtype float16Throughput will be lower than the 0.6B variant (roughly 10,000-15,000 tok/s on an A100 80GB) but quality is substantially higher for complex retrieval tasks. For LoRA multi-adapter serving on top of Jina v4, the same GPU sizing rules apply.
Related Guides
For inference routing with embedding-based query classifiers, see the LLM inference router guide. For alternative serving engines, the Triton Inference Server deployment guide covers embedding model serving with TensorRT optimization. For full RAG pipeline context including vector search and LLM colocation, the agentic RAG GPU infrastructure guide covers the complete stack.
Embedding and reranker workloads run cheaply on mid-tier GPUs. A100 and even RTX 4090 cover most production RAG volumes without paying for H100-level compute. Run your own TEI stack on Spheron and cut managed embedding costs at 100M+ tokens/month.
