SAM 3's video mode keeps a memory bank resident across the full clip duration. That's the core production constraint: the memory bank grows with sequence length and cannot be torn down between calls without losing tracking state. Per-second serverless billing is the wrong model for this workload. This guide covers GPU sizing across image and video modes, installation with official checkpoints, TensorRT export, FastAPI and Triton serving, memory bank tuning for long clips, and a cost comparison against managed CV APIs.
What Is New in SAM 3 vs SAM 2.1
SAM 2 (2024) introduced the video segmentation memory bank, allowing a mask prompt set on frame 0 to propagate through an entire clip without re-prompting. SAM 2.1 tightened tracking accuracy but kept the same core architecture.
SAM 3 makes three targeted changes:
Memory attention gating. The memory bank in SAM 2 treats all stored frames equally during attention. SAM 3 adds learned gating on the memory attention layer, so recent frames and frames with high-confidence masks get higher weight during propagation. The practical result is less identity drift on occluded objects in long clips.
Native 1024px encoder resolution. SAM 2's Hiera encoder handled high-resolution input by interpolating positional embeddings, which introduced edge artifacts on fine-grained masks. SAM 3 trains the Perception Encoder natively at 1024px, eliminating that interpolation step.
FP8 kernel support. Hopper (H100) and Blackwell (B200/B300) GPUs expose hardware FP8 tensor cores. SAM 3's image encoder ships with FP8-compatible attention and feed-forward layers, giving 1.3-1.6x throughput on those architectures without meaningful mask quality regression.
Benchmark Comparison (SAM 2.1 vs SAM 3)
The figures below are from the published SAM 3 paper (arXiv 2511.16719, released November 2025). Verify against the paper for exact numbers.
| Benchmark | SAM 2.1 | SAM 3 | Delta |
|---|---|---|---|
| J-HMDB (J&F) | 82.3 | 85.7 | +3.4 |
| MOSE (J&F) | 73.1 | 77.6 | +4.5 |
| SA-V (J&F) | 78.8 | 82.4 | +3.6 |
| DAVIS 2017 (J&F) | 82.5 | 84.2 | +1.7 |
Gains are largest on MOSE, which has heavy occlusion and crowded scenes. That's where the improved memory gating matters most.
For context on ViT encoder sizing across other vision-language tasks, see Deploy Vision-Language Models on GPU Cloud.
Hardware Sizing: VRAM Requirements Across Image, Video, and Batch Modes
The right GPU depends entirely on whether you are doing image or video inference and at what batch size.
VRAM Table
| Mode | Resolution | Prompts | Min VRAM | Recommended GPU | Notes |
|---|---|---|---|---|---|
| Image (single) | 1024px | 1-5 points | 12 GB | A100 80GB, H100 | FP16 |
| Image (batch 8) | 1024px | per image | 22 GB | A100 80GB | Encoder runs once per image |
| Image (batch 32) | 1024px | per image | 40+ GB | H100 80GB | Requires dynamic batching |
| Video (5 min, 24fps) | 1080p | 1 mask | 28 GB | H100 80GB | Memory bank 15 frames |
| Video (30 min, 24fps) | 1080p | 3 masks | 55 GB | H100 80GB | Memory bank 30 frames |
| Video (60 min, 4K) | 4K | 1 mask | 80+ GB | H200 141GB or 2x H100 | Single-GPU boundary |
GPU Tier Guide
A100 80GB (rent A100 on Spheron): Covers image batch inference up to batch 16 and video clips under 10 minutes at 1080p with a single mask track. PCIe variant works for workloads where memory bandwidth is not the bottleneck. SXM4 variant is better for training or high-throughput annotation pipelines.
H100 80GB (H100 80GB instances on Spheron): The production-grade choice for video segmentation. Handles 30-minute clips with 3 concurrent mask tracks and batch-32 image inference. HBM3 bandwidth keeps the encoder and memory bank fed. FP8 support means you can run SAM 3's optimized kernels natively without quantization overhead.
H200 141GB (rent H200 on Spheron): Removes the single-GPU VRAM ceiling for the largest jobs: 4K video, multi-track masks at 60-minute duration, or co-hosting SAM 3 alongside a large detection model. If your pipeline needs both a 7B Grounding DINO and SAM 3 resident simultaneously, this is the practical minimum for a single-node setup.
B200 192GB (rent B200 on Spheron): For 4K video with multiple simultaneous mask tracks, or multi-GPU-equivalent workloads on a single card. FP8 throughput on Blackwell is the highest available, and the memory headroom eliminates all VRAM planning for SAM 3.
Live GPU Pricing
Pricing below is fetched from the Spheron live API on 13 May 2026.
| GPU | On-Demand (from) | Spot (from) |
|---|---|---|
| A100 80GB PCIe | $1.04/hr | $1.14/hr |
| A100 80GB SXM4 | $1.64/hr | N/A |
| H100 SXM5 | $4.00/hr | $1.69/hr |
| H200 SXM5 | $4.72/hr | $1.89/hr |
| B200 SXM6 | $7.32/hr | $3.78/hr |
Pricing fluctuates based on GPU availability. The prices above are based on 13 May 2026 and may have changed. Check current GPU pricing → for live rates.
For a GPU-to-workload mapping beyond SAM 3, see GPU Requirements Cheat Sheet 2026.
Installing SAM 3: Checkpoints, Dependencies, and ONNX Export
The commands below follow SAM 2's established repository conventions. Verify all paths against the official SAM 3 release.
Python 3.12+ is required by the SAM 3 repository.
# Clone and install
git clone https://github.com/facebookresearch/sam3
cd sam3
pip install -e ".[dev]"
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128
# Download checkpoint from Hugging Face (requires access request at facebook/sam3)
pip install huggingface_hub
huggingface-cli login # authenticate with your HF token
huggingface-cli download facebook/sam3 sam3.pt --local-dir .
# Verify integrity
sha256sum sam3.pt # match against published hash in READMEFor CUDA 12.6, use cu126; cu128 is the default for CUDA 12.8+.
ONNX Export
The image encoder is the compute-intensive part. Export it to ONNX first:
python scripts/export_onnx_model.py \
--checkpoint sam3.pt \
--output sam3_encoder.onnx \
--opset 17The prompt encoder and mask decoder stay in PyTorch. They are lightweight and benefit from dynamic shapes at runtime.
TensorRT Conversion
Convert the ONNX encoder to a TensorRT engine for production throughput:
trtexec \
--onnx=sam3_encoder.onnx \
--saveEngine=sam3_encoder.trt \
--fp16 \
--minShapes=input:1x3x1024x1024 \
--optShapes=input:4x3x1024x1024 \
--maxShapes=input:16x3x1024x1024 \
--memPoolSize=workspace:4096MiBOptimization shape at batch 4 covers most annotation workflows. Extend to batch 16 if you are running a high-throughput pipeline with concurrent users. The engine build takes 3-8 minutes on H100; the result is serialized and reused across restarts.
For a full TensorRT engine build reference including INT4 and FP8 quantization paths, see TensorRT-LLM Production Deployment Guide.
Production Inference: FastAPI Server, Batched Prompts, and Mask Caching
FastAPI Endpoint Structure
import base64
import hashlib
from contextlib import asynccontextmanager
from typing import List, Optional
import cv2
import torch
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sam3 import SamPredictor, build_sam3_with_trt
# --- module-level state ---
predictor: Optional[SamPredictor] = None
embedding_cache: dict = {} # keyed by image SHA-256; use an LRU wrapper for bounded memory
def encode_masks_rle(masks: np.ndarray) -> list:
"""Run-length encode boolean masks for compact JSON transport."""
result = []
for mask in masks:
flat = mask.flatten(order="F").astype(np.uint8)
changes = np.diff(np.concatenate([[0], flat, [0]]))
starts = np.where(changes > 0)[0]
ends = np.where(changes < 0)[0]
counts: list = []
prev = 0
for s, e in zip(starts, ends):
counts.extend([int(s - prev), int(e - s)])
prev = e
counts.append(int(len(flat) - prev))
result.append({"size": list(mask.shape), "counts": counts})
return result
@asynccontextmanager
async def lifespan(app: FastAPI):
global predictor
sam = build_sam3_with_trt(
encoder_engine="sam3_encoder.trt",
checkpoint="sam3.pt",
)
sam.eval().cuda()
predictor = SamPredictor(sam)
# pre-warm with a dummy forward pass to JIT any lazy CUDA kernels
dummy = torch.zeros(1, 3, 1024, 1024, device="cuda")
with torch.inference_mode():
sam.image_encoder(dummy)
yield
app = FastAPI(lifespan=lifespan)
# --- request / response ---
class SegmentRequest(BaseModel):
image_b64: str
points: Optional[List[List[float]]] = None
labels: Optional[List[int]] = None # 1=foreground, 0=background; defaults to all-foreground
boxes: Optional[List[List[float]]] = None
multimask: bool = False
@app.post("/segment")
async def segment(req: SegmentRequest):
img_bytes = base64.b64decode(req.image_b64)
img_array = np.frombuffer(img_bytes, np.uint8)
image = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
if image is None:
raise HTTPException(status_code=422, detail="Invalid image data")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# cache image embeddings by hash to skip encoder on repeated queries
img_hash = hashlib.sha256(img_bytes).hexdigest()
if img_hash not in embedding_cache:
predictor.set_image(image)
embedding_cache[img_hash] = {
"features": predictor.features,
"original_size": predictor.original_size,
"input_size": predictor.input_size,
}
else:
cached = embedding_cache[img_hash]
predictor.features = cached["features"]
predictor.original_size = cached["original_size"]
predictor.input_size = cached["input_size"]
predictor.is_image_set = True
point_coords = np.array(req.points) if req.points else None
if req.points:
labels = req.labels if req.labels else [1] * len(req.points)
point_labels = np.array(labels, dtype=np.int32)
else:
point_labels = None
box = np.array(req.boxes[0]) if req.boxes else None
masks, scores, _ = predictor.predict(
point_coords=point_coords,
point_labels=point_labels,
box=box,
multimask_output=req.multimask,
)
# return RLE-encoded masks
return {"masks": encode_masks_rle(masks), "scores": scores.tolist()}Run with a single uvicorn worker (GPU is shared state) and load-balance at the NGINX or Traefik layer:
uvicorn app:app --workers 1 --host 0.0.0.0 --port 8000Mask Embedding Cache
The key optimization above is the embedding_cache block. In annotation workflows, the same image gets queried 10-50 times with different point prompts. The image encoder is the expensive step; the prompt encoder and mask decoder are cheap. Caching the 256-dim embedding by image hash means all subsequent queries on the same image skip the encoder entirely.
Use a Python dict with LRU eviction for in-process caching, or Redis if you need the cache shared across multiple server replicas.
Video Segmentation Pipeline: Streaming Frames, Memory Bank Tuning, Prompt Propagation
SAM3VideoPredictor Setup
from sam3 import build_sam3_video_predictor
import torch
predictor = build_sam3_video_predictor(
"sam3.pt",
memory_bank_size=15, # SAM 3 uses memory_bank_size; increase to 30 for clips over 10 minutes
)
predictor.eval().cuda()
with torch.inference_mode():
state = predictor.init_state(video_path="input.mp4")
# set prompt on frame 0
predictor.add_new_points_or_box(
state,
frame_idx=0,
obj_id=1,
points=[[500, 300]],
labels=[1], # 1 = foreground
)
# propagate and stream masks to disk
for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
save_mask(frame_idx, masks) # write to disk immediately, do not accumulateStreaming masks to disk inside the loop is important. Accumulating all masks in a list before writing will exhaust RAM on long clips.
Memory Bank Sizing Guide
Each memory bank slot stores a frame's key/value tensors. VRAM cost per slot at 1080p is roughly 200-400 MB depending on the number of tracked objects.
memory_bank_size | Best for | VRAM overhead |
|---|---|---|
| 7 (default) | Clips under 2 min, stable background | +1.4-2.8 GB |
| 15 | Clips 2-10 min, moderate camera motion | +3-6 GB |
| 30 | Clips 10-60 min, occlusions, drift mitigation | +6-12 GB |
| 60 | Hour-long clips, multi-track | +12-24 GB |
Higher values improve tracking consistency at the cost of VRAM. For a 30-minute clip at 1080p with 3 mask tracks and memory_bank_size=30, budget 55+ GB VRAM.
Prompt Propagation Strategies
One-shot: Set the prompt on frame 0 and propagate forward. Works well for clips where the target object is clearly visible at the start and the camera motion is moderate.
Multi-shot: Re-prompt at keyframes to correct drift. After propagation, identify frames where mask confidence drops (usually around occlusions or rapid motion) and add correction points at those frames. Then run a second propagation pass.
Bidirectional: Propagate forward from the prompt frame, then propagate backward from the same frame. Merge the two mask sequences at the seam. This is the most accurate strategy for clips where the target enters mid-clip.
Integration with Detection Models: Grounding DINO and YOLO-World
The combination of an open-vocabulary detector and SAM 3 gives you zero-shot segmentation from a text query. The detector finds objects matching a description; SAM 3 draws the mask.
from groundingdino.util.inference import load_image, load_model, predict
from sam3 import SamPredictor, build_sam3
import torch
import numpy as np
# --- load models ---
grounding_dino = load_model(
"groundingdino/config/GroundingDINO_SwinT_OGC.py",
"weights/groundingdino_swint_ogc.pth",
)
sam3 = build_sam3("sam3.pt")
predictor = SamPredictor(sam3.eval().cuda())
# --- run pipeline ---
def segment_by_text(image_path: str, query: str):
# 1. load and preprocess image — load_image returns the source numpy array and
# a normalized/resized torch.Tensor ready for Grounding DINO
image_source, image_tensor = load_image(image_path)
# 2. get bounding boxes from Grounding DINO
boxes, logits, phrases = predict(
model=grounding_dino,
image=image_tensor,
caption=query,
box_threshold=0.35,
text_threshold=0.25,
)
if len(boxes) == 0:
return np.array([]), np.array([])
# 3. convert boxes from normalized cxcywh to absolute xyxy using source dims,
# then pass source image to SAM 3 (predictor expects an RGB numpy array)
predictor.set_image(image_source)
H, W = image_source.shape[:2]
cx, cy, w, h = boxes[0].numpy() # groundingdino returns normalized cxcywh
box_xyxy = np.array([(cx - w/2)*W, (cy - h/2)*H, (cx + w/2)*W, (cy + h/2)*H])
masks, scores, _ = predictor.predict(
box=box_xyxy,
multimask_output=False,
)
return masks, scoresYOLO-World is the faster alternative: it runs at 30+ FPS for detection on a single GPU, so the combined pipeline (YOLO-World + SAM 3 mask decoder) can stay below 100ms latency for 1080p images when the TensorRT encoder is pre-warmed.
For co-hosting both models in Triton with an ensemble scheduler, see the Triton deployment section later in this post.
Real-World Workloads
Medical imaging. SAM 3 is used for surgical tool tracking in laparoscopic video, polyp segmentation in endoscopy, and tissue classification mask generation for pathology slides. The prompt-based interface is well-suited for radiologist-in-the-loop workflows where a single click generates the initial mask and propagation handles the rest of the clip. Running on a dedicated H200 instance on Spheron keeps patient imaging data within a controlled perimeter and provides the VRAM headroom radiology workloads need. For workloads requiring hardware-level data isolation, see Confidential GPU Computing with NVIDIA TEE.
Robotics perception. SAM 3 generates instance masks that feed directly into manipulation policies as object-centric observations. Combined with Isaac Lab simulation, you can test segmentation pipelines against synthetic data before deploying on real hardware. For Isaac Lab and GR00T setup, see Deploy NVIDIA Isaac GR00T N1 on GPU Cloud.
Content moderation. Logo detection and brand safety enforcement at video scale require segmenting and redacting specific objects frame-by-frame. At 24fps with one GPU tracking a single brand element, SAM 3 processes a 1-hour video in roughly 2-3 hours of wall time (8-12fps throughput on H100 at 1080p). Multi-GPU pipelines cut this proportionally.
Video editing and VFX. Background removal and rotoscoping are the most GPU-intensive editorial tasks in modern NLE workflows. SAM 3's propagation eliminates per-frame manual masking. For hair and fine-detail edges, running SAM 3 in multi-mask mode and compositing the results outperforms single-mask inference on complex subjects.
Cost Comparison: Spheron vs Replicate, Roboflow, and Managed CV APIs
Scenario: segment a 1-hour video at 24fps (86,400 frames), single object tracked, 1080p resolution.
| Provider | Approach | Estimated cost for 1-hr video job | Notes |
|---|---|---|---|
| Spheron H100 SXM5 (spot) | SAM 3 self-hosted | ~$1.69 flat | Spot pricing; check availability |
| Spheron A100 80GB PCIe | SAM 3 self-hosted | ~$1.04 flat | Adequate for 1080p single-track |
| Spheron H200 SXM5 (spot) | SAM 3 self-hosted | ~$1.89 flat | Best for multi-track or 4K |
| Replicate (H100, per-second) | SAM 2 / SAM 3 | $4-8 est. | Cold starts + per-second billing |
| Roboflow Hosted | SAM 2 | $43-108 est. | Per-image pricing at 86k frames |
| AWS Rekognition Video | Managed | $6-15 est. | Per-minute + storage; no custom prompts |
| Google Video Intelligence | Managed | $15-25 est. | Per-minute; no instance segmentation |
Competitor prices are estimates based on published rate cards as of May 2026. Verify current pricing directly with each provider before making infrastructure decisions.
The core economics: per-second serverless platforms charge for cold starts plus idle time between frame batches. SAM 3's memory bank must stay resident across the full clip, which means the GPU cannot be de-allocated between API calls without losing tracking state entirely. With Spheron's hourly billing, the GPU stays dedicated and the memory bank stays warm for the entire 60-minute job. The same dollar buys 86,400 frames or 1 frame; the per-frame cost approaches zero at scale.
For annotation pipelines processing hundreds of videos, A100 spot instances bring the per-video cost below $1. For a full analysis of serverless vs on-demand vs reserved billing across different workload patterns, see Serverless GPU vs On-Demand vs Reserved.
Triton Deployment: Multi-Model Serving with SAM 3, Detector, and Classifier
Model Repository Structure
model_repository/
sam3_encoder/
config.pbtxt # TensorRT backend, fp16, dynamic batching
1/
model.plan # TRT engine
sam3_decoder/
config.pbtxt # ONNX backend
1/
model.onnx
grounding_dino/
config.pbtxt # TensorRT backend
1/
model.plan
cv_ensemble/
config.pbtxt # ensemble schedulerDynamic Batching Config for the Encoder
# sam3_encoder/config.pbtxt
name: "sam3_encoder"
backend: "tensorrt"
max_batch_size: 16
dynamic_batching {
preferred_batch_size: [1, 4, 8]
max_queue_delay_microseconds: 5000
}
input [
{
name: "input"
data_type: TYPE_FP16
dims: [3, 1024, 1024]
}
]
output [
{
name: "image_embeddings"
data_type: TYPE_FP16
dims: [256, 64, 64]
}
]max_queue_delay_microseconds: 5000 holds requests for 5ms to form a larger batch before dispatching. This keeps GPU utilization high on annotation pipelines with many short bursts of concurrent requests.
For a complete Triton setup walkthrough on Spheron, including ensemble model configuration and perf_analyzer benchmarking, see Deploy NVIDIA Triton Inference Server on GPU Cloud.
Latency Optimization: torch.compile, FP16/BF16, and FlashAttention-3
torch.compile on the Image Encoder
import torch
sam3 = build_sam3("sam3.pt").eval().cuda()
# compile only the encoder; decoder stays eager for dynamic prompt shapes
sam3.image_encoder = torch.compile(
sam3.image_encoder,
mode="max-autotune",
fullgraph=True,
)
# warm-up pass (first inference is slow; subsequent calls hit the cache)
with torch.inference_mode():
dummy = torch.randn(1, 3, 1024, 1024, device="cuda", dtype=torch.float16)
sam3.image_encoder(dummy)Expect 15-25% throughput gain on H100 with max-autotune. The warm-up pass is mandatory; without it, the first production request pays the compilation cost. The compilation patterns generalize across workload types; for a reference on torch.compile internals and CUDA graph capture (LLM-focused but applicable to vision encoders), see torch.compile and CUDA Graphs for LLM Inference.
BF16 vs FP16
Use BF16 on H100 and B200. The numeric range of BF16 (same exponent bits as FP32) is better suited to the Perception Encoder's attention layers than FP16's narrower range. On A100, FP16 is fine since BF16 tensor core support is less consistent across driver versions.
sam3 = sam3.to(torch.bfloat16) # H100, B200
# sam3 = sam3.half() # A100 fallbackFlashAttention-3 for Hopper and Blackwell
SAM 3's image encoder uses multi-head self-attention across the 64x64 patch grid. Swapping standard attention for FlashAttention-3 reduces attention FLOPS and memory reads, with throughput gains of 20-40% on H100 and B200.
# if the SAM 3 checkpoint exposes attn_impl configuration:
sam3 = build_sam3(
"sam3.pt",
attn_impl="flash3", # requires flash-attn >= 3.0
)Check the SAM 3 release notes to confirm attn_impl is exposed as a build parameter in your checkpoint version. For FlashAttention-4 benchmarks on Blackwell hardware, see FlashAttention-4 on Blackwell GPU Cloud Guide.
FP8 Quantization
SAM 3's encoder is compatible with FP8 on Hopper (H100) and Blackwell (B200/B300). Use transformer_engine to quantize the attention and feed-forward layers:
import transformer_engine.pytorch as te
# replace encoder attention layers with FP8-compatible variants
with te.fp8_autocast(enabled=True):
embeddings = sam3.image_encoder(image_tensor)On H100, expect 1.3-1.6x throughput gain with minimal mask quality degradation on standard benchmarks. Run a mask IoU comparison on your specific dataset before deploying FP8 in production, since quality sensitivity varies by scene type.
SAM 3 video segmentation keeps the memory bank alive for the full clip. That's the opposite of what per-second serverless pricing is built for. On Spheron, rent a dedicated H100 or H200 by the hour with NVMe-backed storage for mask outputs and bare-metal CUDA access for custom memory bank extensions. Spot pricing on A100 and H100 SXM5 brings the cost down further for batch annotation jobs.
