Tutorial

Deploy SAM 3 on GPU Cloud: Production Image and Video Segmentation Setup Guide for Meta's Segment Anything Model 3 (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 13, 2026
SAM 3 GPU Cloud DeploymentSegment Anything Model 3SAM 3 Video SegmentationDeploy SAM 3 GPUMeta SAM 3 InferenceGPU CloudComputer Vision DeploymentImage Segmentation GPU
Deploy SAM 3 on GPU Cloud: Production Image and Video Segmentation Setup Guide for Meta's Segment Anything Model 3 (2026)

SAM 3's video mode keeps a memory bank resident across the full clip duration. That's the core production constraint: the memory bank grows with sequence length and cannot be torn down between calls without losing tracking state. Per-second serverless billing is the wrong model for this workload. This guide covers GPU sizing across image and video modes, installation with official checkpoints, TensorRT export, FastAPI and Triton serving, memory bank tuning for long clips, and a cost comparison against managed CV APIs.

What Is New in SAM 3 vs SAM 2.1

SAM 2 (2024) introduced the video segmentation memory bank, allowing a mask prompt set on frame 0 to propagate through an entire clip without re-prompting. SAM 2.1 tightened tracking accuracy but kept the same core architecture.

SAM 3 makes three targeted changes:

Memory attention gating. The memory bank in SAM 2 treats all stored frames equally during attention. SAM 3 adds learned gating on the memory attention layer, so recent frames and frames with high-confidence masks get higher weight during propagation. The practical result is less identity drift on occluded objects in long clips.

Native 1024px encoder resolution. SAM 2's Hiera encoder handled high-resolution input by interpolating positional embeddings, which introduced edge artifacts on fine-grained masks. SAM 3 trains the Perception Encoder natively at 1024px, eliminating that interpolation step.

FP8 kernel support. Hopper (H100) and Blackwell (B200/B300) GPUs expose hardware FP8 tensor cores. SAM 3's image encoder ships with FP8-compatible attention and feed-forward layers, giving 1.3-1.6x throughput on those architectures without meaningful mask quality regression.

Benchmark Comparison (SAM 2.1 vs SAM 3)

The figures below are from the published SAM 3 paper (arXiv 2511.16719, released November 2025). Verify against the paper for exact numbers.

BenchmarkSAM 2.1SAM 3Delta
J-HMDB (J&F)82.385.7+3.4
MOSE (J&F)73.177.6+4.5
SA-V (J&F)78.882.4+3.6
DAVIS 2017 (J&F)82.584.2+1.7

Gains are largest on MOSE, which has heavy occlusion and crowded scenes. That's where the improved memory gating matters most.

For context on ViT encoder sizing across other vision-language tasks, see Deploy Vision-Language Models on GPU Cloud.

Hardware Sizing: VRAM Requirements Across Image, Video, and Batch Modes

The right GPU depends entirely on whether you are doing image or video inference and at what batch size.

VRAM Table

ModeResolutionPromptsMin VRAMRecommended GPUNotes
Image (single)1024px1-5 points12 GBA100 80GB, H100FP16
Image (batch 8)1024pxper image22 GBA100 80GBEncoder runs once per image
Image (batch 32)1024pxper image40+ GBH100 80GBRequires dynamic batching
Video (5 min, 24fps)1080p1 mask28 GBH100 80GBMemory bank 15 frames
Video (30 min, 24fps)1080p3 masks55 GBH100 80GBMemory bank 30 frames
Video (60 min, 4K)4K1 mask80+ GBH200 141GB or 2x H100Single-GPU boundary

GPU Tier Guide

A100 80GB (rent A100 on Spheron): Covers image batch inference up to batch 16 and video clips under 10 minutes at 1080p with a single mask track. PCIe variant works for workloads where memory bandwidth is not the bottleneck. SXM4 variant is better for training or high-throughput annotation pipelines.

H100 80GB (H100 80GB instances on Spheron): The production-grade choice for video segmentation. Handles 30-minute clips with 3 concurrent mask tracks and batch-32 image inference. HBM3 bandwidth keeps the encoder and memory bank fed. FP8 support means you can run SAM 3's optimized kernels natively without quantization overhead.

H200 141GB (rent H200 on Spheron): Removes the single-GPU VRAM ceiling for the largest jobs: 4K video, multi-track masks at 60-minute duration, or co-hosting SAM 3 alongside a large detection model. If your pipeline needs both a 7B Grounding DINO and SAM 3 resident simultaneously, this is the practical minimum for a single-node setup.

B200 192GB (rent B200 on Spheron): For 4K video with multiple simultaneous mask tracks, or multi-GPU-equivalent workloads on a single card. FP8 throughput on Blackwell is the highest available, and the memory headroom eliminates all VRAM planning for SAM 3.

Live GPU Pricing

Pricing below is fetched from the Spheron live API on 13 May 2026.

GPUOn-Demand (from)Spot (from)
A100 80GB PCIe$1.04/hr$1.14/hr
A100 80GB SXM4$1.64/hrN/A
H100 SXM5$4.00/hr$1.69/hr
H200 SXM5$4.72/hr$1.89/hr
B200 SXM6$7.32/hr$3.78/hr

Pricing fluctuates based on GPU availability. The prices above are based on 13 May 2026 and may have changed. Check current GPU pricing → for live rates.

For a GPU-to-workload mapping beyond SAM 3, see GPU Requirements Cheat Sheet 2026.

Installing SAM 3: Checkpoints, Dependencies, and ONNX Export

The commands below follow SAM 2's established repository conventions. Verify all paths against the official SAM 3 release.

Python 3.12+ is required by the SAM 3 repository.

bash
# Clone and install
git clone https://github.com/facebookresearch/sam3
cd sam3
pip install -e ".[dev]"
pip install torch==2.10.0 torchvision --index-url https://download.pytorch.org/whl/cu128

# Download checkpoint from Hugging Face (requires access request at facebook/sam3)
pip install huggingface_hub
huggingface-cli login  # authenticate with your HF token
huggingface-cli download facebook/sam3 sam3.pt --local-dir .

# Verify integrity
sha256sum sam3.pt  # match against published hash in README

For CUDA 12.6, use cu126; cu128 is the default for CUDA 12.8+.

ONNX Export

The image encoder is the compute-intensive part. Export it to ONNX first:

bash
python scripts/export_onnx_model.py \
  --checkpoint sam3.pt \
  --output sam3_encoder.onnx \
  --opset 17

The prompt encoder and mask decoder stay in PyTorch. They are lightweight and benefit from dynamic shapes at runtime.

TensorRT Conversion

Convert the ONNX encoder to a TensorRT engine for production throughput:

bash
trtexec \
  --onnx=sam3_encoder.onnx \
  --saveEngine=sam3_encoder.trt \
  --fp16 \
  --minShapes=input:1x3x1024x1024 \
  --optShapes=input:4x3x1024x1024 \
  --maxShapes=input:16x3x1024x1024 \
  --memPoolSize=workspace:4096MiB

Optimization shape at batch 4 covers most annotation workflows. Extend to batch 16 if you are running a high-throughput pipeline with concurrent users. The engine build takes 3-8 minutes on H100; the result is serialized and reused across restarts.

For a full TensorRT engine build reference including INT4 and FP8 quantization paths, see TensorRT-LLM Production Deployment Guide.

Production Inference: FastAPI Server, Batched Prompts, and Mask Caching

FastAPI Endpoint Structure

python
import base64
import hashlib
from contextlib import asynccontextmanager
from typing import List, Optional

import cv2
import torch
import numpy as np
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from sam3 import SamPredictor, build_sam3_with_trt

# --- module-level state ---
predictor: Optional[SamPredictor] = None
embedding_cache: dict = {}  # keyed by image SHA-256; use an LRU wrapper for bounded memory

def encode_masks_rle(masks: np.ndarray) -> list:
    """Run-length encode boolean masks for compact JSON transport."""
    result = []
    for mask in masks:
        flat = mask.flatten(order="F").astype(np.uint8)
        changes = np.diff(np.concatenate([[0], flat, [0]]))
        starts = np.where(changes > 0)[0]
        ends = np.where(changes < 0)[0]
        counts: list = []
        prev = 0
        for s, e in zip(starts, ends):
            counts.extend([int(s - prev), int(e - s)])
            prev = e
        counts.append(int(len(flat) - prev))
        result.append({"size": list(mask.shape), "counts": counts})
    return result

@asynccontextmanager
async def lifespan(app: FastAPI):
    global predictor
    sam = build_sam3_with_trt(
        encoder_engine="sam3_encoder.trt",
        checkpoint="sam3.pt",
    )
    sam.eval().cuda()
    predictor = SamPredictor(sam)
    # pre-warm with a dummy forward pass to JIT any lazy CUDA kernels
    dummy = torch.zeros(1, 3, 1024, 1024, device="cuda")
    with torch.inference_mode():
        sam.image_encoder(dummy)
    yield

app = FastAPI(lifespan=lifespan)

# --- request / response ---
class SegmentRequest(BaseModel):
    image_b64: str
    points: Optional[List[List[float]]] = None
    labels: Optional[List[int]] = None  # 1=foreground, 0=background; defaults to all-foreground
    boxes: Optional[List[List[float]]] = None
    multimask: bool = False

@app.post("/segment")
async def segment(req: SegmentRequest):
    img_bytes = base64.b64decode(req.image_b64)
    img_array = np.frombuffer(img_bytes, np.uint8)
    image = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
    if image is None:
        raise HTTPException(status_code=422, detail="Invalid image data")
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # cache image embeddings by hash to skip encoder on repeated queries
    img_hash = hashlib.sha256(img_bytes).hexdigest()
    if img_hash not in embedding_cache:
        predictor.set_image(image)
        embedding_cache[img_hash] = {
            "features": predictor.features,
            "original_size": predictor.original_size,
            "input_size": predictor.input_size,
        }
    else:
        cached = embedding_cache[img_hash]
        predictor.features = cached["features"]
        predictor.original_size = cached["original_size"]
        predictor.input_size = cached["input_size"]
        predictor.is_image_set = True

    point_coords = np.array(req.points) if req.points else None
    if req.points:
        labels = req.labels if req.labels else [1] * len(req.points)
        point_labels = np.array(labels, dtype=np.int32)
    else:
        point_labels = None
    box = np.array(req.boxes[0]) if req.boxes else None

    masks, scores, _ = predictor.predict(
        point_coords=point_coords,
        point_labels=point_labels,
        box=box,
        multimask_output=req.multimask,
    )

    # return RLE-encoded masks
    return {"masks": encode_masks_rle(masks), "scores": scores.tolist()}

Run with a single uvicorn worker (GPU is shared state) and load-balance at the NGINX or Traefik layer:

bash
uvicorn app:app --workers 1 --host 0.0.0.0 --port 8000

Mask Embedding Cache

The key optimization above is the embedding_cache block. In annotation workflows, the same image gets queried 10-50 times with different point prompts. The image encoder is the expensive step; the prompt encoder and mask decoder are cheap. Caching the 256-dim embedding by image hash means all subsequent queries on the same image skip the encoder entirely.

Use a Python dict with LRU eviction for in-process caching, or Redis if you need the cache shared across multiple server replicas.

Video Segmentation Pipeline: Streaming Frames, Memory Bank Tuning, Prompt Propagation

SAM3VideoPredictor Setup

python
from sam3 import build_sam3_video_predictor
import torch

predictor = build_sam3_video_predictor(
    "sam3.pt",
    memory_bank_size=15,  # SAM 3 uses memory_bank_size; increase to 30 for clips over 10 minutes
)
predictor.eval().cuda()

with torch.inference_mode():
    state = predictor.init_state(video_path="input.mp4")

    # set prompt on frame 0
    predictor.add_new_points_or_box(
        state,
        frame_idx=0,
        obj_id=1,
        points=[[500, 300]],
        labels=[1],  # 1 = foreground
    )

    # propagate and stream masks to disk
    for frame_idx, obj_ids, masks in predictor.propagate_in_video(state):
        save_mask(frame_idx, masks)  # write to disk immediately, do not accumulate

Streaming masks to disk inside the loop is important. Accumulating all masks in a list before writing will exhaust RAM on long clips.

Memory Bank Sizing Guide

Each memory bank slot stores a frame's key/value tensors. VRAM cost per slot at 1080p is roughly 200-400 MB depending on the number of tracked objects.

memory_bank_sizeBest forVRAM overhead
7 (default)Clips under 2 min, stable background+1.4-2.8 GB
15Clips 2-10 min, moderate camera motion+3-6 GB
30Clips 10-60 min, occlusions, drift mitigation+6-12 GB
60Hour-long clips, multi-track+12-24 GB

Higher values improve tracking consistency at the cost of VRAM. For a 30-minute clip at 1080p with 3 mask tracks and memory_bank_size=30, budget 55+ GB VRAM.

Prompt Propagation Strategies

One-shot: Set the prompt on frame 0 and propagate forward. Works well for clips where the target object is clearly visible at the start and the camera motion is moderate.

Multi-shot: Re-prompt at keyframes to correct drift. After propagation, identify frames where mask confidence drops (usually around occlusions or rapid motion) and add correction points at those frames. Then run a second propagation pass.

Bidirectional: Propagate forward from the prompt frame, then propagate backward from the same frame. Merge the two mask sequences at the seam. This is the most accurate strategy for clips where the target enters mid-clip.

Integration with Detection Models: Grounding DINO and YOLO-World

The combination of an open-vocabulary detector and SAM 3 gives you zero-shot segmentation from a text query. The detector finds objects matching a description; SAM 3 draws the mask.

python
from groundingdino.util.inference import load_image, load_model, predict
from sam3 import SamPredictor, build_sam3
import torch
import numpy as np

# --- load models ---
grounding_dino = load_model(
    "groundingdino/config/GroundingDINO_SwinT_OGC.py",
    "weights/groundingdino_swint_ogc.pth",
)
sam3 = build_sam3("sam3.pt")
predictor = SamPredictor(sam3.eval().cuda())

# --- run pipeline ---
def segment_by_text(image_path: str, query: str):
    # 1. load and preprocess image — load_image returns the source numpy array and
    #    a normalized/resized torch.Tensor ready for Grounding DINO
    image_source, image_tensor = load_image(image_path)

    # 2. get bounding boxes from Grounding DINO
    boxes, logits, phrases = predict(
        model=grounding_dino,
        image=image_tensor,
        caption=query,
        box_threshold=0.35,
        text_threshold=0.25,
    )
    if len(boxes) == 0:
        return np.array([]), np.array([])

    # 3. convert boxes from normalized cxcywh to absolute xyxy using source dims,
    #    then pass source image to SAM 3 (predictor expects an RGB numpy array)
    predictor.set_image(image_source)
    H, W = image_source.shape[:2]
    cx, cy, w, h = boxes[0].numpy()  # groundingdino returns normalized cxcywh
    box_xyxy = np.array([(cx - w/2)*W, (cy - h/2)*H, (cx + w/2)*W, (cy + h/2)*H])
    masks, scores, _ = predictor.predict(
        box=box_xyxy,
        multimask_output=False,
    )
    return masks, scores

YOLO-World is the faster alternative: it runs at 30+ FPS for detection on a single GPU, so the combined pipeline (YOLO-World + SAM 3 mask decoder) can stay below 100ms latency for 1080p images when the TensorRT encoder is pre-warmed.

For co-hosting both models in Triton with an ensemble scheduler, see the Triton deployment section later in this post.

Real-World Workloads

Medical imaging. SAM 3 is used for surgical tool tracking in laparoscopic video, polyp segmentation in endoscopy, and tissue classification mask generation for pathology slides. The prompt-based interface is well-suited for radiologist-in-the-loop workflows where a single click generates the initial mask and propagation handles the rest of the clip. Running on a dedicated H200 instance on Spheron keeps patient imaging data within a controlled perimeter and provides the VRAM headroom radiology workloads need. For workloads requiring hardware-level data isolation, see Confidential GPU Computing with NVIDIA TEE.

Robotics perception. SAM 3 generates instance masks that feed directly into manipulation policies as object-centric observations. Combined with Isaac Lab simulation, you can test segmentation pipelines against synthetic data before deploying on real hardware. For Isaac Lab and GR00T setup, see Deploy NVIDIA Isaac GR00T N1 on GPU Cloud.

Content moderation. Logo detection and brand safety enforcement at video scale require segmenting and redacting specific objects frame-by-frame. At 24fps with one GPU tracking a single brand element, SAM 3 processes a 1-hour video in roughly 2-3 hours of wall time (8-12fps throughput on H100 at 1080p). Multi-GPU pipelines cut this proportionally.

Video editing and VFX. Background removal and rotoscoping are the most GPU-intensive editorial tasks in modern NLE workflows. SAM 3's propagation eliminates per-frame manual masking. For hair and fine-detail edges, running SAM 3 in multi-mask mode and compositing the results outperforms single-mask inference on complex subjects.

Cost Comparison: Spheron vs Replicate, Roboflow, and Managed CV APIs

Scenario: segment a 1-hour video at 24fps (86,400 frames), single object tracked, 1080p resolution.

ProviderApproachEstimated cost for 1-hr video jobNotes
Spheron H100 SXM5 (spot)SAM 3 self-hosted~$1.69 flatSpot pricing; check availability
Spheron A100 80GB PCIeSAM 3 self-hosted~$1.04 flatAdequate for 1080p single-track
Spheron H200 SXM5 (spot)SAM 3 self-hosted~$1.89 flatBest for multi-track or 4K
Replicate (H100, per-second)SAM 2 / SAM 3$4-8 est.Cold starts + per-second billing
Roboflow HostedSAM 2$43-108 est.Per-image pricing at 86k frames
AWS Rekognition VideoManaged$6-15 est.Per-minute + storage; no custom prompts
Google Video IntelligenceManaged$15-25 est.Per-minute; no instance segmentation

Competitor prices are estimates based on published rate cards as of May 2026. Verify current pricing directly with each provider before making infrastructure decisions.

The core economics: per-second serverless platforms charge for cold starts plus idle time between frame batches. SAM 3's memory bank must stay resident across the full clip, which means the GPU cannot be de-allocated between API calls without losing tracking state entirely. With Spheron's hourly billing, the GPU stays dedicated and the memory bank stays warm for the entire 60-minute job. The same dollar buys 86,400 frames or 1 frame; the per-frame cost approaches zero at scale.

For annotation pipelines processing hundreds of videos, A100 spot instances bring the per-video cost below $1. For a full analysis of serverless vs on-demand vs reserved billing across different workload patterns, see Serverless GPU vs On-Demand vs Reserved.

Triton Deployment: Multi-Model Serving with SAM 3, Detector, and Classifier

Model Repository Structure

model_repository/
  sam3_encoder/
    config.pbtxt        # TensorRT backend, fp16, dynamic batching
    1/
      model.plan        # TRT engine
  sam3_decoder/
    config.pbtxt        # ONNX backend
    1/
      model.onnx
  grounding_dino/
    config.pbtxt        # TensorRT backend
    1/
      model.plan
  cv_ensemble/
    config.pbtxt        # ensemble scheduler

Dynamic Batching Config for the Encoder

protobuf
# sam3_encoder/config.pbtxt
name: "sam3_encoder"
backend: "tensorrt"
max_batch_size: 16

dynamic_batching {
  preferred_batch_size: [1, 4, 8]
  max_queue_delay_microseconds: 5000
}

input [
  {
    name: "input"
    data_type: TYPE_FP16
    dims: [3, 1024, 1024]
  }
]
output [
  {
    name: "image_embeddings"
    data_type: TYPE_FP16
    dims: [256, 64, 64]
  }
]

max_queue_delay_microseconds: 5000 holds requests for 5ms to form a larger batch before dispatching. This keeps GPU utilization high on annotation pipelines with many short bursts of concurrent requests.

For a complete Triton setup walkthrough on Spheron, including ensemble model configuration and perf_analyzer benchmarking, see Deploy NVIDIA Triton Inference Server on GPU Cloud.

Latency Optimization: torch.compile, FP16/BF16, and FlashAttention-3

torch.compile on the Image Encoder

python
import torch

sam3 = build_sam3("sam3.pt").eval().cuda()
# compile only the encoder; decoder stays eager for dynamic prompt shapes
sam3.image_encoder = torch.compile(
    sam3.image_encoder,
    mode="max-autotune",
    fullgraph=True,
)

# warm-up pass (first inference is slow; subsequent calls hit the cache)
with torch.inference_mode():
    dummy = torch.randn(1, 3, 1024, 1024, device="cuda", dtype=torch.float16)
    sam3.image_encoder(dummy)

Expect 15-25% throughput gain on H100 with max-autotune. The warm-up pass is mandatory; without it, the first production request pays the compilation cost. The compilation patterns generalize across workload types; for a reference on torch.compile internals and CUDA graph capture (LLM-focused but applicable to vision encoders), see torch.compile and CUDA Graphs for LLM Inference.

BF16 vs FP16

Use BF16 on H100 and B200. The numeric range of BF16 (same exponent bits as FP32) is better suited to the Perception Encoder's attention layers than FP16's narrower range. On A100, FP16 is fine since BF16 tensor core support is less consistent across driver versions.

python
sam3 = sam3.to(torch.bfloat16)  # H100, B200
# sam3 = sam3.half()            # A100 fallback

FlashAttention-3 for Hopper and Blackwell

SAM 3's image encoder uses multi-head self-attention across the 64x64 patch grid. Swapping standard attention for FlashAttention-3 reduces attention FLOPS and memory reads, with throughput gains of 20-40% on H100 and B200.

python
# if the SAM 3 checkpoint exposes attn_impl configuration:
sam3 = build_sam3(
    "sam3.pt",
    attn_impl="flash3",  # requires flash-attn >= 3.0
)

Check the SAM 3 release notes to confirm attn_impl is exposed as a build parameter in your checkpoint version. For FlashAttention-4 benchmarks on Blackwell hardware, see FlashAttention-4 on Blackwell GPU Cloud Guide.

FP8 Quantization

SAM 3's encoder is compatible with FP8 on Hopper (H100) and Blackwell (B200/B300). Use transformer_engine to quantize the attention and feed-forward layers:

python
import transformer_engine.pytorch as te

# replace encoder attention layers with FP8-compatible variants
with te.fp8_autocast(enabled=True):
    embeddings = sam3.image_encoder(image_tensor)

On H100, expect 1.3-1.6x throughput gain with minimal mask quality degradation on standard benchmarks. Run a mask IoU comparison on your specific dataset before deploying FP8 in production, since quality sensitivity varies by scene type.


SAM 3 video segmentation keeps the memory bank alive for the full clip. That's the opposite of what per-second serverless pricing is built for. On Spheron, rent a dedicated H100 or H200 by the hour with NVMe-backed storage for mask outputs and bare-metal CUDA access for custom memory bank extensions. Spot pricing on A100 and H100 SXM5 brings the cost down further for batch annotation jobs.

Rent H100 → | Rent A100 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.