Engineering

Deploy NVIDIA Holoscan on GPU Cloud: Real-Time Sensor AI for Medical Imaging and Industrial Inspection (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 18, 2026
NVIDIA Holoscan GPU CloudHoloscan SDK DeploymentReal-Time Sensor AI GPUMedical Imaging AI CloudHoloscan GXFIndustrial AI InspectionGPU Cloud Medical AIEdge AI GPU CloudHIPAA GPU CloudFDA SaMD GPU
Deploy NVIDIA Holoscan on GPU Cloud: Real-Time Sensor AI for Medical Imaging and Industrial Inspection (2026)

Holoscan runs endoscopy AI at sub-5ms frame latency on device, but the same SDK runs on GPU cloud for three practical reasons: model development and training on recorded sensor data, batch reprocessing of historical archives, and fleet inference where edge hardware is cost-prohibitive. For the broader trade-off between running AI on cloud versus edge hardware, the hybrid cloud edge AI inference guide covers the decision framework in detail.

This guide focuses specifically on Holoscan: what it is, how its GXF architecture works, which GPU to pick on cloud, and how to deploy production pipelines for medical imaging and industrial inspection workloads.

What Holoscan Is

Holoscan is a streaming graph executor, not a model server and not a training framework. It processes continuous sensor streams frame-by-frame using a directed graph of operators connected by zero-copy GPU buffers.

The core concepts:

  • GXF (Graph Execution Framework): The underlying C++ runtime that Holoscan 4.x builds on. Operators are nodes; connections are edges; message passing between operators happens on GPU buffers.
  • Zero-copy buffer handoff: Holoscan tensors wrap a CUDA device pointer with shape and dtype metadata. When two operators on the same GPU exchange tensors, no memory copy occurs. Frames never leave the GPU between operators unless you explicitly request it.
  • HoloHub: A community operator library spanning sensors, codecs, and visualization. Most real pipelines use at least a few HoloHub operators.
  • Primary deployment targets: Medical devices (endoscopes, ultrasound probes, surgical robots), industrial systems (quality inspection cameras, radar), and research labs building new sensor AI models.

The key distinction from other inference runtimes: Holoscan processes continuous sensor streams where every frame has a deadline, not discrete requests that can be batched and queued. That design choice drives everything else, including why zero-copy matters and why the multi-thread scheduler exists.

Holoscan vs Triton vs Dynamo: Which Runtime to Use

RuntimePrimary UseData ModelLatency Target
HoloscanSensor/imaging pipelinesContinuous streamsSub-10ms per frame
TritonRequest/response model servingDiscrete requests10-100ms p99
DynamoDisaggregated LLM inferenceToken streamsTTFT + TPOT SLAs

Holoscan wins when the input is a continuous sensor stream and per-frame latency matters. Endoscopy video at 30fps gives you 33ms per frame budget. Ultrasound at 60fps gives you 16ms. Holoscan's event-driven scheduler and zero-copy buffer handoff keep inter-operator latency under 1ms on a single GPU, which is what makes those budgets achievable in practice.

Triton wins when you have discrete inference requests and need multi-model batching, dynamic batching across concurrent clients, or a multi-framework model repository. A medical imaging API that accepts DICOM uploads via REST and runs a series of segmentation models is a Triton workload, not a Holoscan workload. The Triton inference server deployment guide covers that pattern in detail.

Dynamo handles disaggregated LLM inference where prefill and decode run on separate pools of GPUs. That problem space doesn't overlap with sensor pipelines. See the NVIDIA Dynamo disaggregated inference guide if you're working on LLM throughput optimization.

The decision rule is simple: if the input is a continuous sensor stream and you care about per-frame latency, use Holoscan. If you need to serve a REST endpoint handling thousands of discrete model requests, use Triton.

Holoscan Architecture: GXF, Operators, and Zero-Copy Buffer Handoff

The GXF Runtime

GXF is the underlying C++ graph execution engine. Holoscan 4.x provides Python and C++ bindings on top of it.

Operators are the units of computation. Each operator has a compute() method that fires on each tick of the scheduler. Operators declare input and output ports; the runtime wires them together.

Schedulers control execution order:

  • Greedy (single-threaded): default for development. Runs operators in dependency order, blocking until each completes.
  • Multi-thread: parallel operator execution for production pipelines where independent branches can process simultaneously.

Conditions control when an operator's compute() fires. DownstreamMessageAffordableCondition prevents a producer from pushing frames faster than the downstream operator can consume them. MessageAvailableCondition blocks a consumer until input is ready. These two conditions together implement the backpressure and synchronization that make real-time streaming reliable.

Zero-Copy Buffer Handoff

Holoscan tensors wrap a CUDA device pointer with shape and dtype metadata. When two operators on the same GPU exchange tensors, the handoff is a pointer swap, not a memcpy. On a single GPU running a full decode-preprocess-infer-postprocess pipeline, the inter-operator latency is under 1ms even for large tensors.

Copies to CPU only happen when explicitly requested: writing to disk, sending over a network socket, or handing off to a CPU-based postprocessing operator. Design your pipeline to keep data on the GPU from ingest to output whenever possible.

Pipeline Topology

Linear pipelines are the most common pattern: source, decode, preprocess, inference, postprocess, sink. Most medical imaging cases fit this structure.

Branching pipelines fan out from one source to multiple parallel operators. A surgical video stream running simultaneous polyp detection and instrument tracking, for example:

VideoStreamReplayerOp
        |
  FormatConverterOp
      /       \
InferenceOp   InferenceOp
(polyp det.)  (instrument)
      \       /
   ResultMergeOp
        |
    CustomSinkOp

The multi-thread scheduler runs both inference branches in parallel, cutting total latency compared to a sequential linear graph.

Sample Pipelines

1. Endoscopy Polyp Detection

  • Input: Surgical video stream (720p/1080p, 30-60fps)
  • Decode: NVDEC hardware decode via VideoStreamReplayerOp or GStreamerVideoSourceOp
  • Preprocess: Resize and normalize to 256x256 or 512x512 (FormatConverterOp)
  • Inference: ResNet or EfficientDet polyp segmentation model via InferenceOp (TensorRT FP16)
  • Postprocess: Threshold and bounding box extraction
  • Output: Annotated video stream or JSON result records
  • Target GPU: L40S (AV1/H.265 hardware decode, 48GB GDDR6 for large batch development runs)
  • Latency expectation: 3-8ms per frame on L40S in production Holoscan mode

2. Ultrasound Real-Time Segmentation

  • Input: Ultrasound probe raw data or B-mode image stream
  • Pipeline: Similar decode-infer structure with DNN-based segmentation (U-Net, nnU-Net)
  • Key characteristic: Ultrasound models tend to be small (under 50M parameters), so inference is not the bottleneck. I/O and preprocessing dominate the frame budget.
  • Cloud use case: Training segmentation models on large DICOM archives is where cloud GPU adds most value, not live bedside inference where edge hardware is required.

3. Industrial Defect Inspection

  • Input: Line-scan or area-scan industrial camera (GigE Vision or Camera Link via sensor bridge)
  • Model: Anomaly detection (PatchCore, EfficientAD) or classification CNNs
  • Output: Pass/fail JSON to PLC or SCADA system
  • Cloud use case: Reprocessing historical inspection footage for model retraining; A/B testing new models before deployment to factory floor where downtime costs are high.

Hardware Guide: L40S, RTX Pro 6000, and H100 for Holoscan

GPUVRAMArchFP32 TFLOPSHW Video DecodeBest For
L40S48GB GDDR6Ada Lovelace91.6NVDEC (AV1, H.265)Holoscan development + medium vision models
RTX Pro 600096GB GDDR7Blackwell~125NVDEC (AV1, H.265)Large vision models, long batch reprocessing
H100 SXM580GB HBM3Hopper67NVDECFoundation model inference stages in the graph

Sizing note: Most single-channel Holoscan inference pipelines use 8-12GB VRAM for the model and working tensors. The L40S's 48GB headroom supports running 4-6 concurrent channel pipelines in development, or large batch reprocessing jobs where you load a full model plus multiple input batches simultaneously. The H100's HBM3 bandwidth advantage (3.35 TB/s versus 864 GB/s on the L40S) matters primarily when the model exceeds 1B parameters and becomes memory-bandwidth bound, which is not typical for sensor AI models.

The RTX Pro 6000's 96GB GDDR7 at ~125 TFLOPS makes it the right call for large vision foundation models embedded as one stage in a longer graph, or for developers who want extra VRAM headroom during research and model prototyping.

Cloud vs Edge: When to Run Holoscan on GPU Cloud

Three distinct cloud use cases:

Model development and training. Recorded sensor archives (endoscopy video datasets, ultrasound DICOM collections) are too large for on-device storage and require too much compute for on-device training. Run preprocessing pipelines and training jobs on cloud GPUs, then export quantized TensorRT models to deploy on edge devices for bedside or factory-floor inference.

Batch reprocessing. Historical sensor recordings need to be run through updated models for quality review, labeling, or clinical validation studies. Cloud GPU processes data faster than real-time. A 1-hour surgical recording runs through a production Holoscan polyp detection graph in under 10 minutes on an L40S, compared to 60 minutes of actual recording time. For large archives, that throughput multiplier directly cuts project timelines.

Fleet inference. Distributed sensor networks (100+ factory cameras, remote clinical sites) where shipping an IGX Orin to each location is cost-prohibitive. Cloud GPU instances handle inference for fleet streams over RTSP/RTMP. Latency SLAs above ~50ms round-trip are achievable on well-connected cloud instances; below that, on-device is required.

When on-device is mandatory, including direct surgical feedback at sub-5ms latency, air-gapped regulatory environments, or always-on inference at sites without reliable network connectivity, IGX Orin is the correct target. See the hybrid cloud edge AI inference guide for the full decision framework covering when each tier makes sense.

Deploying Holoscan on Spheron GPU Cloud

Step 1: Provision Your Instance

Provision an L40S GPU rental on Spheron for vision-focused development workloads including endoscopy video pipelines, ultrasound segmentation, and industrial inspection. Use an H100 GPU rental on Spheron when your operator graph includes a large foundation model inference stage (1B+ parameters) where HBM3 bandwidth matters. The Spheron getting started guide walks through account creation, billing setup, and SSH configuration if this is your first deployment.

Choose on-demand for interactive development and iterating on operator graphs; use spot for overnight batch reprocessing of historical sensor archives.

Step 2: Pull the Holoscan Container from NGC

bash
docker pull nvcr.io/nvidia/clara-holoscan/holoscan:v4.2.0-cuda12-dgpu
# Check the NGC catalog for the latest release tag: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara-holoscan/containers/holoscan
# No NGC account required for public container pulls

For HoloHub operators, clone the repository and mount it at runtime:

bash
git clone https://github.com/nvidia-holoscan/holohub
# Mount at runtime:
# docker run --gpus all -v $(pwd)/holohub:/workspace/holohub ...

Step 3: Build a Minimal Python Pipeline

This is a minimal complete Holoscan application that replays a recorded video, runs inference, and writes results to JSON:

python
import holoscan
from holoscan.core import Application
from holoscan.operators import (
    VideoStreamReplayerOp,
    FormatConverterOp,
    InferenceOp,
)
import json

class ResultSinkOp(holoscan.core.Operator):
    """Custom sink: write inference results to JSON."""
    def setup(self, spec):
        spec.input("in")

    def compute(self, op_input, op_output, context):
        value = op_input.receive("in")
        # Write result to JSON output file
        result = value.as_py() if hasattr(value, "as_py") else str(value)
        with open("/mnt/output/results.jsonl", "a") as f:
            f.write(json.dumps({"frame": context.current_timestamp, "result": result}) + "\n")

class EndoscopyApp(Application):
    def compose(self):
        replayer = VideoStreamReplayerOp(
            self,
            name="replayer",
            directory="/mnt/data/surgical_video",
            basename="video",
        )
        converter = FormatConverterOp(
            self,
            name="converter",
            in_dtype="rgb888",
            out_dtype="float32",
            resize_width=512,
            resize_height=512,
        )
        inferencer = InferenceOp(
            self,
            name="inferencer",
            backend="trt",
            model_path_map={"polyp_model": "/mnt/models/polyp_det_fp16.engine"},
        )
        sink = ResultSinkOp(self, name="sink")

        self.add_flow(replayer, converter)
        self.add_flow(converter, inferencer, {("tensor", "receivers")})
        self.add_flow(inferencer, sink, {("transmitter", "in")})

if __name__ == "__main__":
    app = EndoscopyApp()
    app.run()

Step 4: GStreamer Ingestion for Live Streams

For live RTSP camera streams, use GStreamer with NVDEC hardware decode:

bash
# Test the GStreamer pipeline from the command line first
gst-launch-1.0 \
  rtspsrc location=rtsp://camera-ip:8554/stream latency=100 ! \
  rtph264depay ! h264parse ! \
  nvh264dec ! \
  nvvideoconvert ! \
  video/x-raw,format=RGB ! \
  appsink name=holoscan_sink max-buffers=1 drop=true

Configure the GStreamerVideoSourceOp from HoloHub in your operator graph to consume from this pipeline. For cloud batch reprocessing, skip GStreamer entirely and use VideoStreamReplayerOp reading from NVMe-mounted files, which avoids the network variable.

Step 5: Headless Cloud Run and Result Export

Disable HolovizOp for headless cloud runs. Replace it with a custom sink writing results to JSON or pushing to a message queue:

python
# In compose(), replace HolovizOp with a custom sink
# For production headless runs:
sink = ResultSinkOp(self, name="sink")

# Mount a Spheron persistent volume for output:
# docker run --gpus all \
#   -v /mnt/spheron-pv/output:/mnt/output \
#   -v /mnt/nvme/data:/mnt/data \
#   holoscan-app python endoscopy_app.py

Monitor GPU utilization with nvidia-smi dmon -s u while the pipeline runs. A well-optimized Holoscan graph runs the inference operator at 80-95% SM utilization on a single channel. If you see utilization below 60%, the bottleneck is likely I/O (switch to NVMe staging) or a CPU-bound preprocessing step.

Multi-Tenant Inference: Running Multiple Channels on One GPU

For fleet inference scenarios with multiple concurrent camera channels, the L40S's 48GB VRAM supports running 4-6 independent pipelines simultaneously.

Each pipeline instance is a separate process with its own CUDA context. Without any coordination, concurrent contexts compete for the GPU's SM resources and each gets a fraction of the available compute.

CUDA MPS (Multi-Process Service) improves aggregate utilization significantly. When two pipeline processes each run at 45% GPU utilization without MPS, aggregate utilization is around 55% due to scheduling gaps. With MPS enabled, the same two processes share the GPU's SM resources at the kernel level, pushing aggregate utilization to 80-90%.

Enable MPS before starting your pipeline processes:

bash
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d

Cost implication: one L40S at the spot rate of $1.03/hr handles 4 independent 8GB camera channels, versus $4.12/hr for 4 separate single-GPU spot nodes doing the same work. For fleet inference with latency SLAs above 20ms, multi-tenant MPS on a single L40S is the right architecture.

Regulatory Considerations: HIPAA, FDA SaMD, and EU MDR

The following is operational guidance based on publicly available regulatory documents, not legal advice. Consult qualified regulatory counsel before deploying Holoscan pipelines in a regulated medical context.

HIPAA (US Medical AI)

Any Holoscan pipeline processing protected health information (PHI), including patient images or video containing patient identifiers, requires a Business Associate Agreement (BAA) with the cloud provider.

Operational requirements:

  • Data encrypted at rest (AES-256) and in transit (TLS 1.2+)
  • Audit logs for all access to PHI workloads
  • Dedicated compute where possible (no shared hypervisor, no shared memory with other tenants) to reduce attack surface

Bare-metal GPU instances provide stronger PHI isolation than shared virtual instances. Each Spheron instance runs on dedicated hardware without a hypervisor layer between your workload and the GPU.

FDA Software as a Medical Device (SaMD)

Holoscan-based inference intended to diagnose, treat, cure, mitigate, or prevent a medical condition qualifies as SaMD under FDA guidance.

Key considerations:

  • FDA's PCCP (Predetermined Change Control Plan) pathway allows AI/ML-based SaMD to update models without a new 510(k) if the change plan is pre-approved. Plan for this from the start if you expect to iterate on models in production.
  • Cloud-hosted inference is a new software architecture component that must appear in the device's Software Bill of Materials (SBOM) and risk management file.
  • Cloud hosting does not automatically create new regulatory obligations, but the cloud environment must be included in software validation documentation.

EU MDR and GDPR

EU medical devices under MDR (2017/745) and IVD MDR (2017/746) require EU data residency for patient data processed in the cloud.

  • GDPR Article 46 transfers (standard contractual clauses) are insufficient for clinical AI on their own. MDR Article 10 requires comprehensive technical documentation including software validation that covers the cloud deployment environment.
  • Spheron's EU-region instances and confidential compute options with hardware-attested TEE environments address data residency and isolation requirements for regulated Holoscan deployments.
  • For privacy-preserving AI approaches that reduce the PHI surface, see the federated learning GPU cloud guide.
RegulationTriggerCloud Requirement
HIPAAUS patient data (PHI)BAA, encryption at rest/transit, audit logs
FDA SaMDDiagnostic/therapeutic AI outputSBOM inclusion, software validation, PCCP plan
EU MDR / GDPREU patient dataEU data residency, TEE isolation, MDR Art. 10 docs

GPU Pricing for Holoscan Workloads

GPUVRAMOn-Demand ($/hr)Spot ($/hr)Best Holoscan Use
L40S48GB$1.99$1.03Development, multi-channel fleet inference
RTX Pro 600096GB$1.77$0.59Large vision models, long batch reprocessing
H100 SXM580GB$3.90$0.80Foundation model inference stages

Pricing fluctuates based on GPU availability. The prices above are based on 18 May 2026 and may have changed. Check current GPU pricing → for live rates.

Production Checklist

  1. GPU driver version: Holoscan requires CUDA 12.x or CUDA 13.x. Verify nvidia-smi shows driver 535 or newer before starting (per the NVIDIA Holoscan SDK installation guide). Holoscan ships both CUDA 12 and CUDA 13 container variants; pick the one that matches your driver.
  2. Container runtime: Install nvidia-container-toolkit for GPU passthrough into Docker containers.
  3. NVMe staging: Mount a fast NVMe volume for sensor data input. Object storage adds I/O latency that caps pipeline throughput, particularly for high-fps video streams.
  4. Headless mode: Disable HolovizOp for cloud deployments. Replace with a custom sink writing JSON or pushing to a message queue.
  5. MPS for multi-channel: Enable CUDA MPS when running 2+ concurrent pipeline processes on a single GPU. Without MPS, processes compete for SM resources and aggregate utilization drops.
  6. Monitoring: Use nvidia-smi dmon or the Prometheus/DCGM exporter. See the GPU monitoring guide for setup. Target SM utilization of 60-90% per pipeline instance for a well-tuned Holoscan graph.
  7. Model export: Export PyTorch or TensorFlow models to TensorRT via trtexec before embedding in Holoscan. TensorRT FP16 provides 2-4x throughput improvement over ONNX Runtime on the same GPU for typical sensor AI model sizes.

Cost Example: Batch Reprocessing a 10,000-Hour Endoscopy Archive

10,000 hours of surgical video at 30fps equals 1.08 billion total frames. A polyp detection pipeline on the L40S processes 1080p frames in roughly 5ms, giving ~200fps throughput. That is a 6.67x speedup over real-time.

Processing time for the full archive: 10,000 hours / 6.67 = ~1,500 hours on a single L40S.

ConfigurationWall-Clock TimeTotal Cost (L40S spot at $1.03/hr)
1x L40S~1,500 hours~$1,545
8x L40S (parallel)~187.5 hours~$1,545

The total compute cost is the same either way. The 8-instance configuration cuts wall-clock time from 63 days to under 8 days, which matters when a clinical validation study has a deadline.

Spot instances apply for batch reprocessing jobs since they tolerate interruption. With a simple checkpoint that tracks which video segments have been processed, an interrupted L40S spot instance picks up where it left off on the next run.

Spheron's bare-metal L40S and H100 instances give consistent throughput without noisy-neighbor variability, which matters when you're planning around a fixed project timeline. EU data-residency options support MDR-regulated medical imaging workloads without separate data transfer arrangements.


Holoscan pipelines need bare-metal GPU access to hit the sub-10ms frame latency that sensor AI requires. Spheron's L40S and H100 instances give you dedicated GPU hardware with EU data-residency options for regulated medical and industrial deployments.

Rent L40S → | Rent H100 → | RTX Pro 6000 → | View all GPU pricing →

Start deploying on Spheron →

STEPS / 05

Quick Setup Guide

  1. Provision an L40S or H100 instance on Spheron

    Log into app.spheron.ai and provision an L40S 48GB for vision-focused Holoscan pipelines (endoscopy video, ultrasound, industrial cameras) or an H100 SXM5 80GB when the graph includes a large foundation model inference stage. Choose on-demand for interactive development and testing; use spot for overnight batch reprocessing of historical sensor recordings.

  2. Pull the Holoscan container from NGC

    Run: docker pull nvcr.io/nvidia/clara-holoscan/holoscan:<version>. The latest release tag is found on the NVIDIA NGC catalog page for Clara Holoscan. No NGC account is required for public container pulls. For the HoloHub operator extensions, clone https://github.com/nvidia-holoscan/holohub and mount it into the container at runtime.

  3. Configure your operator graph in Python or C++

    Define your pipeline by subclassing holoscan.core.Application. Add operators for video input (VideoStreamReplayerOp), preprocessing (FormatConverterOp), inference (InferenceOp with a TensorRT or ONNX model), and visualization or output (HolovizOp or custom sink). Connect operators with add_flow(). For cloud-native deployments without a physical sensor, use VideoStreamReplayerOp to replay recorded sensor data from disk or object storage.

  4. Configure GStreamer ingestion for live or streamed sensor data

    For live camera streams, use GStreamer with an RTSP source and nvh264dec for hardware-accelerated decode on dGPU cloud instances (nvh264dec is from gst-plugins-nvcodec and works on x86_64 L40S/H100; nvv4l2decoder is Jetson-only and will fail on cloud). Pass decoded frames to Holoscan via the GStreamerVideoSourceOp from HoloHub. For cloud batch reprocessing, read from NVMe-mounted video files using VideoStreamReplayerOp. Ensure the decode step runs on the GPU (NVDEC) rather than CPU to avoid introducing latency in the pipeline.

  5. Run the pipeline and export results

    Launch the application with python my_pipeline.py or ./my_pipeline from within the container. For headless cloud runs, disable the HolovizOp visualizer or replace it with a custom sink that writes inference results to JSON, HL7 FHIR messages, or database records. Monitor GPU utilization with nvidia-smi dmon; a well-optimized Holoscan pipeline will run at near-100% GPU utilization for the inference operator.

FAQ / 05

Frequently Asked Questions

NVIDIA Holoscan is a streaming sensor AI SDK designed for real-time medical imaging and industrial inspection pipelines. It runs low-latency operator graphs on NVIDIA GPUs, enabling applications like endoscopy AI, ultrasound segmentation, radar processing, and factory defect detection at sub-10ms end-to-end latency.

Holoscan is an event-driven streaming pipeline executor designed for continuous sensor data (video frames, ultrasound samples, radar returns). Triton is a request/response model server optimized for batched inference of discrete requests. Use Holoscan when the sensor data is continuous and latency per frame matters. Use Triton when you have discrete inference requests and want multi-model batching and concurrency management.

The L40S is the primary cloud GPU for Holoscan workloads: 48GB GDDR6 with AV1 hardware encode/decode and strong FP32 throughput for vision models. The RTX Pro 6000 is a close second with similar specs and ECC memory. The H100 SXM5 is overkill for most Holoscan sensor pipelines but suits large foundation model inference stages embedded in the graph.

Yes. NVIDIA publishes official Holoscan container images on NGC (nvcr.io/nvidia/clara-holoscan/holoscan). The container bundles the GXF runtime, Holoscan SDK, and HoloHub operators. Pull it on any NVIDIA-GPU-equipped instance with CUDA 12.x and Docker 24+, then run your operator graph with the holoscan run command or Python API.

Any Holoscan deployment processing protected health information (PHI) in the US requires a Business Associate Agreement (BAA) with the cloud provider. Data must be encrypted in transit and at rest. EU deployments processing patient data under GDPR/EU MDR need to ensure data residency within the EEA. GPU cloud providers with EU data-residency guarantees and audit logging are required for regulated medical imaging workloads.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.