Multimodal Embeddings on GPU Cloud: SigLIP-2, JinaCLIP-v2, and Cohere Embed-v4 (2026)

At 500M image-text pairs per month, Cohere Embed-v4's managed API costs roughly $50,000. A single A100 80GB on Spheron at $1.19/hr spot pricing processes the same volume for roughly $9,000-10,000. That gap is why teams building product search, visual knowledge bases, and mixed-media RAG pipelines are moving multimodal embedding inference in-house.

This guide covers production deployment of four multimodal embedding models: SigLIP-2, JinaCLIP-v2, Cohere Embed-v4, and Voyage-multimodal-3. You'll get VRAM sizing tables with live-fetched Spheron pricing, a complete Infinity-Embedding setup, vector database integration patterns, and a cost-per-million benchmark you can use to find your own crossover point.

For text-only embedding deployment, the TEI self-hosting guide covers the text-side stack with BGE-M3, Qwen3-Embedding, and cross-encoder rerankers.

Why Text-Only Embeddings Break for Visual Knowledge Bases

A text embedding model encodes the words you feed it. That sounds obvious, but it becomes a real problem once your document corpus includes anything that does not translate cleanly to text.

Three failure modes come up repeatedly in production:

Charts and graphs. A revenue chart has axes, legend labels, and data points. When you flatten it to text with OCR or a caption model, you get something like "Q1 Q2 Q3 Q4 Revenue Growth YoY". The structural relationships that make the chart meaningful, which bar is tallest, what the trend is, are gone. A multimodal embedding model sees the image directly and preserves that structure in the vector.

Screenshot-heavy UIs. Product documentation that ships as screenshots has no parsable text at all for most of its information surface. Error dialogs, UI state captures, and configuration screenshots are all noise if you rely on text extraction. A CLIP-family model can retrieve the screenshot of the error dialog when the user asks about that error.

Product catalog images. A red mesh running shoe has shape, material texture, color, and silhouette. The text description might say "athletic shoe, red" and miss the visual details that distinguish it from ten other red shoes. Multimodal embeddings capture those visual features directly.

Content type	Text embedding score	Multimodal embedding score	Gap
Charts and data visualizations	Low (axis text only)	High (structure preserved)	Large
UI screenshots and error dialogs	Very low (no OCR text)	Medium-high	Very large
Product images with visual detail	Low-medium (metadata only)	High	Large
Scanned document text pages	Medium (OCR dependent)	Medium	Small
Text-heavy slides with bullet points	High	High	Minimal

The gap matters most when your corpus is more than 20-30% non-text content. Below that threshold, text-only embeddings with good OCR preprocessing are usually adequate.

The 2026 Multimodal Embedding Landscape

Five models are worth evaluating for production vision-text RAG as of June 2026:

Model	Architecture	Output dim	Max image resolution	Open weights	Serving recommendation
SigLIP-2 ViT-SO400M	SigLIP dual-encoder (ViT-SO400M + text)	1152	512x512	Yes (Apache 2.0)	Infinity-Embedding
JinaCLIP-v2	CLIP dual-encoder (ViT-L/14 + Jina text)	1024 (Matryoshka: 64-1024)	224x224	Yes (Apache 2.0)	Infinity-Embedding
Cohere Embed-v4	Proprietary	1024	Variable	Commercial license tier	vLLM (self-hosted) or API
Voyage-multimodal-3	Proprietary dual-encoder	1024	Variable	No (API only)	Managed API only
NV-Embed-Multimodal	NVIDIA ViT + text encoder	4096	336x336	Yes (research license)	Triton or TGI

Benchmark scores (MTEB-Multimodal and Vidore-v2 leaderboard positions) as of June 2026. Scores update as new model versions are submitted.

SigLIP-2 ViT-SO400M is currently the strongest open model for pure image-text similarity tasks. Google trained it with a sigmoid loss function rather than the standard InfoNCE contrastive loss, which produces better-calibrated similarity scores and stronger performance on fine-grained image retrieval benchmarks. The SO400M variant (400M parameter ViT-SO backbone) requires more VRAM than smaller SigLIP-2 variants but delivers meaningfully better retrieval accuracy on visual content.

JinaCLIP-v2 adds two things SigLIP-2 lacks: multilingual text support across 89 languages and Matryoshka Representation Learning. The Matryoshka option lets you shrink embeddings from 1024 to 64 dimensions at serving time with modest accuracy loss, which matters when you are storing millions of embeddings and every byte counts. Use JinaCLIP-v2 when your corpus has non-English text alongside images.

Cohere Embed-v4 is the strongest performer on MTEB-Multimodal benchmarks as of Q2 2026, but the self-hostable weights tier requires a commercial license agreement with Cohere separate from the API terms. Do not assume the weights are freely redistributable. The API mode is the path of least resistance for low-volume workloads; self-hosting makes sense above 200-300M tokens per month where the cost math clearly favors it.

Voyage-multimodal-3 from Voyage AI is API-only. Voyage does not publish self-hostable weights for this model. It appears in this comparison for teams evaluating managed API options but the deployment walkthrough sections below do not cover it.

NV-Embed-Multimodal from NVIDIA Research delivers the highest output dimensionality (4096-dim) and is designed for retrieval tasks where maximum discriminability matters. The research license restricts commercial use; confirm terms before deploying in production.

For ColPali-style late-interaction document retrieval (PDFs, slides) where you need patch-level visual matching rather than single-vector dense search, see ColPali on GPU Cloud as an alternative to the dense multimodal embeddings covered here.

Hardware Sizing

The following table uses live-fetched pricing from the Spheron API as of June 2, 2026.

GPU	VRAM	SigLIP-2 batch size	JinaCLIP-v2 batch size	Throughput (img-text pairs/hr)	On-demand $/hr	Spot $/hr
L40S	48GB	128	256	~40,000-60,000	$0.96	N/A
A100 80GB PCIe	80GB	256	512	~55,000-75,000	$1.48	$1.19
H100 SXM5	80GB	256	512	~75,000-95,000	$3.92	$2.91
B200 SXM6	192GB	512	1024	~120,000-160,000	$7.35	$2.68

Throughput figures are at FP16 precision, batch size as listed, with Infinity-Embedding and CUDA graph optimization enabled. Image preprocessing (resize, normalize) runs on-GPU via the model's AutoProcessor. Exact throughput varies based on image resolution, JPEG decode overhead, and whether text and image batches run concurrently; treat these as estimates for planning purposes.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Jun 2026 and may have changed. Check current GPU pricing for live rates.

L40S 48GB is the production sweet spot for embedding-only deployments. 48GB GDDR6 provides enough VRAM to run multiple model workers or a model plus a co-located reranker. At $0.96/hr, it offers a strong cost-per-pair ratio for mid-volume workloads. L40S on Spheron

A100 80GB PCIe is the best pick for pure embedding throughput per dollar. The A100's HBM2e memory bandwidth (1.94 TB/s) handles large batches efficiently and spot pricing at $1.19/hr makes it cheaper than the L40S for high-utilization workloads. A100 on Spheron

H100 SXM5 makes sense when you are co-locating the embedding server with a 7B-14B LLM (see LLM serving optimization guide) in the same RAG stack, where the combined VRAM budget requires 80GB. The H100's 3.35 TB/s HBM3 bandwidth delivers noticeably higher throughput than the A100 on large batch sizes but is overkill for embedding-only workloads. H100 on Spheron

B200 SXM6 is for teams running very high-volume indexing jobs (10M+ pairs per day) or co-locating embedding plus a 70B+ LLM. 192GB HBM3e handles multiple concurrent model replicas without VRAM pressure. Spot pricing at $2.68/hr makes it cost-competitive with H100 on-demand for sustained workloads.

Production Deployment

Infinity-Embedding Setup

Infinity-Embedding (github.com/michaelfeil/infinity) is the recommended serving layer for CLIP-family models. It ships with OpenAI-compatible /embeddings, /v1/embeddings, and model-specific /encode_image and /encode_text endpoints, and handles image preprocessing (resize, normalize) automatically using each model's registered AutoProcessor.

Deploy JinaCLIP-v2:

bash

docker run --gpus all \
  -p 7997:7997 \
  -e INFINITY_LOG_LEVEL=INFO \
  -e INFINITY_BATCH_SIZE=128 \
  -e INFINITY_DEVICE=cuda \
  michaelfeil/infinity:latest v2 \
  --model-id jinaai/jina-clip-v2 \
  --served-model-name jina-clip-v2 \
  --batch-size 128 \
  --device cuda

Deploy SigLIP-2 ViT-SO400M:

bash

docker run --gpus all \
  -p 7997:7997 \
  -e INFINITY_LOG_LEVEL=INFO \
  -e INFINITY_BATCH_SIZE=128 \
  -e INFINITY_DEVICE=cuda \
  michaelfeil/infinity:latest v2 \
  --model-id google/siglip2-so400m-patch16-512 \
  --served-model-name siglip2 \
  --batch-size 128 \
  --device cuda

The server starts downloading model weights on first run. On an A100 80GB, model load takes 30-60 seconds for SigLIP-2 SO400M. Set --preload-only to warm the model before opening the port if you need zero cold-start latency on the first request.

Embed an image via base64:

bash

IMAGE_B64=$(base64 -w 0 /path/to/your/image.jpg)

curl -s http://localhost:7997/v1/embeddings \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"jina-clip-v2\",
    \"input\": [\"data:image/jpeg;base64,${IMAGE_B64}\"]
  }" | python3 -c "import json,sys; d=json.load(sys.stdin); print(len(d['data'][0]['embedding']), 'dims')"

Always use base64-encoded images in production API calls rather than image URLs. URL-based inputs add latency proportional to the image download time and introduce a network dependency that can cause timeout failures under load.

Embed text:

bash

curl -s http://localhost:7997/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jina-clip-v2",
    "input": ["a photo of a red running shoe"]
  }'

Cohere Embed-v4 Deployment

Cohere Embed-v4 does not publish weights on Hugging Face for general use. The self-hosted weights tier requires a direct commercial agreement with Cohere, and there is no publicly documented vLLM model registration for cohere/embed-v4 accessible without Cohere customer credentials.

For most users, Cohere Embed-v4 is API-only. Use the Cohere Python client (cohere.Client) to access it via the managed API. Contact Cohere directly if you need self-hosted weights for a production deployment. The API is the correct path for low-to-medium volume workloads; at high volumes (200M+ tokens/month), evaluate whether Cohere's enterprise self-hosting terms make economic sense for your use case.

For the generation step in a full Cohere-native RAG pipeline, Command A on GPU Cloud is the natural pairing. 111B parameters, 256K context, and native compatibility with Cohere's embed and rerank APIs, all running on your own GPU infrastructure.

Image Preprocessing Pipeline

Each model has specific preprocessing requirements. Using the wrong normalization is one of the most common causes of low retrieval accuracy when switching between models.

Model	Normalization mean	Normalization std	Input resolution
SigLIP-2 (all variants)	[0.5, 0.5, 0.5]	[0.5, 0.5, 0.5]	512x512
JinaCLIP-v2	[0.485, 0.456, 0.406]	[0.229, 0.224, 0.225]	224x224
NV-Embed-Multimodal	[0.48145466, 0.4578275, 0.40821073]	[0.26862954, 0.26130258, 0.27577711]	336x336

When using Infinity-Embedding via the API, the server handles preprocessing automatically. When embedding directly via Hugging Face Transformers, always load the processor from the model's repo:

python

from transformers import AutoProcessor, AutoModel
import torch
from PIL import Image

model_id = "google/siglip2-so400m-patch16-512"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.float16).cuda()

image = Image.open("product.jpg").convert("RGB")
text = "a red running shoe"

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding="max_length",
)
inputs = {k: v.cuda() for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

image_emb = outputs.image_embeds  # shape: [1, 1152]
text_emb = outputs.text_embeds    # shape: [1, 1152]

Never call transforms.ToTensor() and manual normalization on CLIP-family models unless you have read the model card and confirmed the exact values. The normalization statistics differ between SigLIP-2, standard CLIP (OpenAI), and OpenCLIP variants.

Integrating with Vector Databases

Qdrant Collection Setup

Create separate payload fields for image and text embeddings so you can filter at query time:

python

import uuid
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="product_catalog",
    vectors_config=VectorParams(
        size=1024,       # 1024 for JinaCLIP-v2, 1152 for SigLIP-2 SO400M
        distance=Distance.COSINE,
    ),
)

# Index an image embedding
# Qdrant IDs must be unsigned 64-bit integers or UUID-format strings
client.upsert(
    collection_name="product_catalog",
    points=[
        PointStruct(
            id=str(uuid.uuid5(uuid.NAMESPACE_DNS, "product_001_image")),
            vector=image_embedding,
            payload={"modality": "image", "product_id": "001", "category": "footwear"},
        ),
    ],
)

# Index a text embedding from the product description
client.upsert(
    collection_name="product_catalog",
    points=[
        PointStruct(
            id=str(uuid.uuid5(uuid.NAMESPACE_DNS, "product_001_text")),
            vector=text_embedding,
            payload={"modality": "text", "product_id": "001", "category": "footwear"},
        ),
    ],
)

For cross-modal retrieval (text query against image embeddings), use a payload filter:

python

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="product_catalog",
    query_vector=text_query_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="modality", match=MatchValue(value="image"))]
    ),
    limit=10,
)

Milvus Mixed-Modality Schema

python

from pymilvus import CollectionSchema, FieldSchema, DataType, Collection, connections

connections.connect("default", host="localhost", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=1024),
    FieldSchema(name="modality", dtype=DataType.VARCHAR, max_length=16),
    FieldSchema(name="source_id", dtype=DataType.VARCHAR, max_length=256),
]

schema = CollectionSchema(fields=fields, description="multimodal product embeddings")
collection = Collection("product_catalog", schema)

index_params = {
    "metric_type": "COSINE",
    "index_type": "GPU_CAGRA",
    "params": {"intermediate_graph_degree": 64, "graph_degree": 32},
}
collection.create_index("vector", index_params)

Note on the modality gap: CLIP-family image and text embeddings are not in perfectly aligned distributions. Image embeddings and text embeddings tend to cluster in different regions of the unit sphere. This means raw cosine similarity between an image embedding and a text embedding is numerically lower than text-text similarity for equivalent semantic content. Normalize your embeddings to unit norm before storing, and benchmark retrieval accuracy with your specific corpus rather than assuming textbook cosine similarity thresholds apply directly.

Hybrid Retrieval: BM25 Sparse + Dense Multimodal

For corpora that mix rich text documents with images, Qdrant's sparse-dense hybrid search combines BM25 keyword scoring with dense multimodal similarity in one query. This catches cases where exact keyword matches in text fields should override visual similarity scores.

python

from qdrant_client.models import SparseVector, NamedSparseVector, NamedVector, SparseVectorParams

# Hybrid search requires a separately configured collection with named dense + sparse vectors
client.create_collection(
    collection_name="product_catalog_hybrid",
    vectors_config={"multimodal-dense": VectorParams(size=1024, distance=Distance.COSINE)},
    sparse_vectors_config={"text-sparse": SparseVectorParams()},
)

# Assume you have both a dense query vector and a BM25 sparse vector
results = client.query_points(
    collection_name="product_catalog_hybrid",
    prefetch=[
        {"query": NamedSparseVector(name="text-sparse", vector=bm25_query_vector), "limit": 50},
        {"query": NamedVector(name="multimodal-dense", vector=dense_query_vector), "limit": 50},
    ],
    query={"fusion": "rrf"},  # Reciprocal Rank Fusion
    limit=10,
)

For full Qdrant, Milvus, and Weaviate deployment on GPU cloud, see the vector database self-hosting guide.

For an agentic RAG architecture where multimodal embeddings feed a tool-calling agent, see the agentic RAG infrastructure guide.

Cost Per Million Embeddings

The following comparison uses live Spheron pricing fetched on June 2, 2026 for GPU rows, and public API documentation pricing for managed API rows.

Approach	Throughput (pairs/hr)	GPU cost/hr	Cost per 1M pairs	Break-even (pairs/month)
Cohere Embed-v4 API	N/A (API-bound)	N/A	$100 (at ~300 tokens/pair avg)	N/A
OpenAI text-embedding-3-large + CLIP API	N/A	N/A	$130 (text) + ~$15 (CLIP)	N/A
SigLIP-2 on L40S	~50,000	$0.96	~$19.20 at 100% utilization	~7M pairs/month
JinaCLIP-v2 on A100 80GB PCIe	~65,000	$1.48	~$22.80 at 100% utilization	~11M pairs/month
SigLIP-2 on H100 SXM5	~85,000	$3.92	~$46.10 at 100% utilization	~28M pairs/month

At 60% average utilization, divide the 100% throughput by 0.6 and multiply cost/hr to get realistic cost per pair. At 60% utilization, the L40S + SigLIP-2 row comes out to roughly $32 per 1M pairs. That is still 3x cheaper than the Cohere API.

The break-even column shows the monthly pair volume at which running a dedicated GPU instance 24/7 costs the same as paying the Cohere API for the same volume. Below that threshold, on-demand API is cheaper because the GPU's fixed monthly cost exceeds the API variable cost. Above it, self-hosting wins.

Pricing fluctuates based on GPU availability. The prices above are based on 02 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Latency Tuning

ONNX Export

Exporting SigLIP-2 or JinaCLIP-v2 to ONNX with Optimum reduces inference latency by 20-40% on Ampere and Hopper compared to PyTorch eager mode:

bash

pip install optimum[onnxruntime-gpu]

optimum-cli export onnx \
  --model google/siglip2-so400m-patch16-512 \
  --task feature-extraction \
  siglip2_onnx/

python

# Load for inference
from optimum.onnxruntime import ORTModelForFeatureExtraction
model = ORTModelForFeatureExtraction.from_pretrained("siglip2_onnx/", provider="CUDAExecutionProvider")

On an A100 80GB with FP16 ONNX export, SigLIP-2 SO400M at batch size 128 drops from ~85ms to ~55ms per batch, roughly 35% faster.

FP8 Quantization on Hopper and Blackwell

Infinity-Embedding supports FP8 inference on H100/H200/B200 via the --dtype fp8 flag. On H100 SXM5, FP8 roughly doubles throughput for CLIP-family models compared to FP16 at comparable accuracy:

bash

docker run --gpus all -p 7997:7997 \
  michaelfeil/infinity:latest v2 \
  --model-id google/siglip2-so400m-patch16-512 \
  --dtype fp8 \
  --batch-size 256 \
  --device cuda

Note: FP8 support requires the NVIDIA FP8 kernel library and is only available on sm90+ (Hopper) and sm100+ (Blackwell) GPUs. On A100 (sm80) and below, the flag silently falls back to FP16.

Dynamic Batching

Infinity-Embedding's --batch-size flag sets the maximum tokens per batch, not a fixed request batch size. The server accumulates incoming requests up to batch-size and processes them together. For online serving (low latency, unpredictable request rate), set batch-size to 64-128. For offline indexing (high throughput, predictable large batches), set it to 256-512.

To limit queue depth and add back-pressure when traffic spikes, use the --max-concurrent-requests flag (available in recent Infinity-Embedding releases; verify the flag name against your specific container tag as it may differ across versions):

bash

docker run --gpus all -p 7997:7997 \
  michaelfeil/infinity:latest v2 \
  --model-id jinaai/jina-clip-v2 \
  --batch-size 128 \
  --max-concurrent-requests 256 \
  --device cuda

Warm-Up

CLIP models with CUDA graph optimization can have cold-start latency of 2-5 seconds on the first request. Pre-warm the model by sending a dummy request immediately after startup, before opening the container port for production traffic:

bash

# In your container entrypoint, after starting Infinity-Embedding
# The model takes 30-60 seconds to load; poll /health until the server is ready
until curl -sf http://localhost:7997/health > /dev/null 2>&1; do sleep 5; done
# Replace <served-model-name> with the value you passed to --served-model-name
# (e.g. "siglip2" for SigLIP-2, "jina-clip-v2" for JinaCLIP-v2)
curl -s -o /dev/null http://localhost:7997/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "<served-model-name>", "input": ["warmup"]}'

Common Failure Modes

Image normalization mismatch. The symptom is low cosine similarity between clearly similar image-text pairs, typically below 0.3 when you would expect 0.6+. The cause is using the wrong mean and std for normalization, often happening when you copy preprocessing code from a different CLIP variant (OpenAI CLIP uses different stats than SigLIP-2). Always use AutoProcessor.from_pretrained(model_id) and never hard-code normalization unless you have verified the values against the model card.

OOM under burst load. The symptom is CUDA out-of-memory errors appearing under traffic spikes that do not occur under steady-state load. The cause is unbounded request queue depth: requests pile up, each allocating image tensors, until VRAM fills. Fix: set a concurrent-request limit in Infinity-Embedding (e.g., --max-concurrent-requests; verify the exact flag name against your container version) to a value where peak memory stays below 90% of VRAM capacity. Add a load balancer health check that returns 503 when the queue depth exceeds your threshold.

Modality-gap degradation. Image-to-text retrieval accuracy drops 10-20% below benchmark numbers on your domain data. The cause is the well-documented CLIP modality gap: the image and text embedding distributions sit in different regions of the unit sphere even after training. The gap is particularly wide when your domain differs from the model's training distribution (e.g., technical diagrams versus natural photos). The mitigations are: fine-tune with contrastive loss on domain pairs, or switch to a late-interaction model like ColPali for document-heavy corpora.

Resolution mismatch. Embedding a 4K image directly at a model expecting 224x224 or 512x512 input causes the processor to resize and compress fine detail. For product catalog images where visual detail distinguishes variants (color, texture, logo), resize to a square at 2x the model's native resolution before passing to the processor. The processor will resize down, but starting from a higher resolution reduces aliasing artifacts. For screen captures and UI screenshots, preserve aspect ratio by padding rather than stretching before resize.

Voyage-multimodal-3 deployment attempts. If you try to deploy Voyage-multimodal-3 with Infinity-Embedding or vLLM, neither will find the weights because Voyage does not publish self-hostable weights for this model. Use the Voyage API directly via their Python client.

Multimodal RAG pipelines combining image and text embeddings run well on Spheron's L40S instances and A100 80GB GPUs, with per-minute billing and no reserved capacity minimums.
Check H100 availability for multimodal embedding workloads

STEPS / 06

Quick Setup Guide

Choose a multimodal embedding model
Select based on modality requirements and VRAM budget. SigLIP-2 ViT-SO400M is the strongest open model for image-text similarity as of June 2026. JinaCLIP-v2 adds multilingual text support and a Matryoshka embedding option (shrink to 64-dim for storage savings). Cohere Embed-v4 is a managed API with a self-hostable weights tier; use it when you need zero infrastructure ownership. For document-heavy corpora (PDFs, slides) where you need late-interaction retrieval rather than single-vector dense search, see the ColPali approach instead at /blog/colpali-multimodal-document-rag-gpu-cloud/.
Provision a GPU on Spheron
Log in to app.spheron.ai and select a GPU instance. For dev and testing, an L40S 48GB is a practical starting point that also works for small-scale production. For high-throughput production embedding serving, an A100 80GB gives the best throughput-per-dollar. For co-located embedding plus LLM (full multimodal RAG in one node), use an H100 80GB or H200 141GB. All on-demand instances include SSH root access and support Docker with GPU pass-through.
Deploy Infinity-Embedding server
Infinity-Embedding (michaelfeil/infinity) is the recommended server for CLIP-family models. Start it with: docker run --gpus all -p 7997:7997 michaelfeil/infinity:latest v2 --model-id jinaai/jina-clip-v2 --served-model-name jina-clip-v2 --batch-size 128 --device cuda. For SigLIP-2, use the same command with --model-id google/siglip2-so400m-patch16-512. The server exposes an OpenAI-compatible /embeddings endpoint and supports separate image and text encoding endpoints.
Configure the image preprocessing pipeline
Each model has specific normalization requirements. SigLIP-2 uses mean=[0.5, 0.5, 0.5] and std=[0.5, 0.5, 0.5] with a target resolution of 512x512. JinaCLIP-v2 uses ImageNet normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) at 224x224. Infinity-Embedding handles this automatically when using its API, but if embedding directly via the Transformers library, ensure you use the model's associated AutoProcessor rather than manual preprocessing.
Store embeddings in a vector database
For Qdrant, create a collection with vector_size matching the model output dim (1024 for JinaCLIP-v2, 1152 for SigLIP-2 SO400M) and distance=Cosine. Store image embeddings and text embeddings in the same collection using a payload field (modality: 'image' or modality: 'text') to filter at query time. For hybrid text + image retrieval, Qdrant's sparse-dense hybrid search lets you combine BM25 text scoring with dense multimodal similarity in one query. See the vector DB deployment guide at /blog/self-host-vector-database-gpu-cloud-qdrant-milvus-weaviate/ for full cluster setup.
Run a cost and throughput benchmark
Before going to production, measure tokens per second and cost per 1M tokens for your specific batch size and image resolution. Use Infinity-Embedding's /metrics endpoint (Prometheus format) to read embeddings/sec in real time. Compare against the equivalent Cohere Embed-v4 or OpenAI text-embedding-3 + CLIP API cost at your monthly token volume. A dedicated L40S running 24/7 breaks even against the Cohere API at around 7M pairs per month; a dedicated A100 breaks even at around 11M pairs per month.

FAQ / 05

Frequently Asked Questions

A multimodal embedding model maps images and text into a shared vector space so you can retrieve images with text queries, find text passages with image queries, or rank mixed content by relevance. Models like SigLIP-2, JinaCLIP-v2, and Cohere Embed-v4 produce a single dense vector per input regardless of modality, making them drop-in replacements for text-only embedding models in RAG pipelines that also ingest screenshots, product photos, charts, or slide decks.

SigLIP-2 ViT-SO400M (the most capable publicly available variant) requires roughly 3-4GB of VRAM at FP16 for the model weights. Practical batch serving needs an additional 2-6GB for activation buffers and input image tensors, so plan for at least 8GB per worker. An L40S 48GB or A100 80GB can run 4-8 concurrent SigLIP-2 workers with headroom for the image preprocessing pipeline.

The L40S 48GB offers a good balance of throughput and cost for most production workloads. At $0.96/hr on-demand, a single L40S can process roughly 40,000-60,000 image-text pairs per hour at batch size 128 with JinaCLIP-v2 or SigLIP-2. The A100 80GB at $1.48/hr on-demand (or $1.19/hr spot) costs more per hour but delivers higher throughput due to its HBM2e memory bandwidth, making it the better pick for sustained high-throughput embedding workloads where GPU utilization will be consistently high.

Yes, if you project all embeddings into the same dimensionality. Cohere Embed-v4 and JinaCLIP-v2 both produce 1024-dim vectors that can coexist in the same Qdrant or Milvus collection as text embeddings, provided the collection was created with matching vector dimensions and distance metric (cosine). Mixing embeddings from different model families in the same collection is not recommended as the vector spaces are not aligned.

Cohere Embed-v4 via managed API charges approximately $0.10 per 1M tokens. An A100 80GB on Spheron at spot pricing ($1.19/hr) processes roughly 50,000-80,000 image-text pairs per hour, each pair averaging around 300 tokens. At sustained 80-100% GPU utilization — typical for batch indexing — self-hosting costs around $0.06-0.08 per 1M tokens, roughly 1.3-1.7x cheaper than the Cohere API. The crossover point for a dedicated on-demand A100 ($1.48/hr, running 24/7) vs. the Cohere API is approximately 11M pairs per month; below that volume, the API costs less once you account for the always-on GPU cost.

Why Text-Only Embeddings Break for Visual Knowledge Bases

The 2026 Multimodal Embedding Landscape

Hardware Sizing

Production Deployment

Infinity-Embedding Setup

Cohere Embed-v4 Deployment

Image Preprocessing Pipeline

Integrating with Vector Databases

Qdrant Collection Setup

Milvus Mixed-Modality Schema

Hybrid Retrieval: BM25 Sparse + Dense Multimodal

Cost Per Million Embeddings

Latency Tuning

ONNX Export

FP8 Quantization on Hopper and Blackwell

Dynamic Batching

Warm-Up

Common Failure Modes

Quick Setup Guide

Choose a multimodal embedding model

Provision a GPU on Spheron

Deploy Infinity-Embedding server

Configure the image preprocessing pipeline

Store embeddings in a vector database

Run a cost and throughput benchmark

Frequently Asked Questions

01What is a multimodal embedding model?

02How much VRAM does SigLIP-2 need for production inference?

03Which GPU gives the best cost per million multimodal embeddings?

04Can I use the same vector database for multimodal and text-only embeddings?

05How much cheaper is self-hosting multimodal embeddings compared to Cohere Embed-v4's API?

Build what's next.