Tutorial

Deploy 3D Gaussian Splatting on GPU Cloud: Real-Time Radiance Field Rendering for AR, VR, and Robotics (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 8, 2026
3D Gaussian Splatting GPU CloudDeploy 3D Gaussian SplattingGaussian Splatting vs NeRFNovel View Synthesis4D Gaussian SplattingCOLMAP Gaussian Splattinggsplat Multi-GPU TrainingNerfstudio splatfactoGaussian Splatting TutorialH100 Gaussian Splatting TrainingL40S Rendering ServerAutonomous Driving Gaussian SplattingAR VR Gaussian SplattingGaussian Splatting Robotics SimulationWebGPU Gaussian Splatting
Deploy 3D Gaussian Splatting on GPU Cloud: Real-Time Radiance Field Rendering for AR, VR, and Robotics (2026 Guide)

NeRF made photoreal scene reconstruction viable, but it took days to train and rendered at fractions of a second per frame - which ruled it out for production AR/VR and robotics pipelines. 3D Gaussian Splatting changed that: training in hours, rendering in real time, and scaling to the kinds of large scenes that matter for autonomous driving and embodied AI. The problem is that both training and serving are GPU-bound, and the GPU profiles differ. This guide covers how to run a 3DGS production pipeline on GPU cloud from COLMAP preprocessing through rendering server deployment.

What Is 3D Gaussian Splatting and Why It Replaced NeRF

Instead of a neural network queried per ray, a 3DGS scene is represented as a set of 3D Gaussians. Each Gaussian is a colored, opaque ellipsoid with position, covariance, opacity, and spherical harmonic color coefficients encoding view-dependent appearance. Rendering is differentiable rasterization: Gaussians are projected onto the image plane and alpha-composited front-to-back. Backpropagation adjusts positions, shapes, and colors to minimize photometric loss against training views.

The original CVPR 2023 paper from INRIA introduced both the representation and a custom CUDA rasterizer. Within months of publication, the method had production-ready implementations across several frameworks and had replaced NeRF as the default approach for most scene reconstruction tasks.

What Makes 3DGS Faster Than NeRF

Two architectural differences explain the speedup. First, there is no neural network forward pass during rendering. Gaussian rasterization is pure computation: project, sort by depth, composite. Render time scales with Gaussian count, not network depth. Second, training uses densification heuristics rather than volume sampling. Adaptive density control adds Gaussians in under-reconstructed regions and removes low-opacity ones, so the optimizer converges in 30,000 iterations (2-4 hours on an H100) versus 100k+ iterations for NeRF.

At render time on modern GPUs, 3DGS achieves 30-120 FPS at 1080p for scenes in the 1-5 million Gaussian range. NeRF approaches with similar quality require seconds per frame.

Limitations to Know Before Production Use

Four constraints matter for production planning:

  1. Gaussian count grows with scene complexity. Large outdoor captures and autonomous driving datasets produce 10-20 million Gaussians. This directly drives VRAM requirements during training and serving.
  2. No implicit geometry. There is no mesh or watertight surface. Downstream tasks that need geometry (collision detection, physics simulation integration) require additional mesh extraction via post-processing (e.g., SuGaR, GOF).
  3. Dynamic scene support requires extensions. Static 3DGS cannot handle moving objects. 4D-GS and Deformable 3D Gaussians add a temporal dimension for dynamic content.
  4. VRAM-bound training. The Gaussian parameter store lives on GPU. Training large scenes exhausts VRAM faster than most deep learning workloads because the Gaussian count can reach tens of millions.

GPU Requirements: VRAM Scaling with Scene Complexity

This is the most important section for cloud instance selection. The right GPU depends almost entirely on your scene scale.

Training VRAM Table

SceneImagesEst. GaussiansMin VRAMRecommended GPU
Small interior200500K8 GBRTX 4090
Medium outdoor1,0002-3M24 GBA100 40GB
Large outdoor5,0005-8M40-80 GBA100 80GB
AV dataset (multi-camera)100,00010-20M80 GB+H100 SXM5
4D-GS dynamic scene10,00015M+80 GB+H100 SXM5

Note that gsplat supports multi-GPU training via DDP for the largest scenes. For 10M+ Gaussian scenes, distributing across 4-8 H100 nodes cuts training time proportionally and eliminates single-card VRAM ceilings.

Training vs Rendering: Two Different GPU Profiles

Training and serving have fundamentally different GPU requirements, and getting this split right is how you avoid over-spending.

Training is memory-bandwidth and VRAM-bound. The optimizer tracks positions, covariances, opacity, and spherical harmonic coefficients for millions of Gaussians, plus gradients and optimizer state. HBM-based GPUs (A100, H100) are the right choice: they combine high VRAM capacity with 2-3 TB/s memory bandwidth that keeps the densification and rendering steps fed.

Rendering is also compute and bandwidth-bound, but the VRAM requirement is much lower. At serve time, you only need to load the PLY file plus per-frame tile assignments. A scene with 10 million Gaussians fits in under 5 GB of VRAM during rendering. This means lower-cost GPUs cover serving workloads without any quality trade-off.

The practical implication: use H100 or A100 instances for training (provision them, run training, terminate them), then use L40S GPU rental instances for long-running rendering servers. Per-minute billing on cloud instances makes this split economically clean.

Training Pipelines

Inria 3DGS (Original Implementation)

The reference implementation from the CVPR 2023 paper (graphdeco-inria/gaussian-splatting). PyTorch with custom CUDA kernels for differentiable rasterization. Ships with a SIBR-based real-time viewer. Single-GPU only in its base form.

Install via conda with CUDA extensions compiled at install time:

bash
conda create -n gaussian_splatting python=3.9
conda activate gaussian_splatting
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install plyfile tqdm
git clone https://github.com/graphdeco-inria/gaussian-splatting --recursive
pip install submodules/diff-gaussian-rasterization
pip install submodules/simple-knn

Good for: research baselines, small-to-medium scenes, any situation where you want the exact original paper implementation.

gsplat (Nerfstudio's 3DGS Backend)

Open-source, Apache 2.0, and actively maintained. Multi-GPU DDP support distinguishes it from the Inria implementation. Lower memory footprint than the original due to optimized CUDA kernels. Integrates with Nerfstudio's splatfacto method.

bash
pip install gsplat nerfstudio

Run training with Nerfstudio:

bash
ns-train splatfacto --data /path/to/colmap_output

Good for: production training pipelines, large scenes where multi-GPU is needed, teams already using Nerfstudio.

brush

Browser-based interactive 3DGS trainer built in Rust/WebGPU. Not suited for large production scenes, but useful for rapid iteration on small captures. The in-browser feedback loop lets you verify reconstruction quality before committing GPU time on cloud instances. Think of it as a preview tool.

4D Gaussian Splatting for Dynamic Scenes

4D-GS extends static 3DGS with a temporal dimension. A time-conditioned deformation field maps each Gaussian's canonical position to its position at time t. This handles dynamic objects (vehicles, pedestrians, deforming surfaces) that static 3DGS cannot represent.

VRAM requirements are 15-25% higher than static 3DGS for equivalent scene size, because the deformation field adds parameters. Training time increases proportionally. For autonomous driving datasets with moving objects, 4D-GS is the standard approach.

For teams combining 4D-GS with fully synthetic data generation for rare-event augmentation, generating synthetic training data for robotics with NVIDIA Cosmos covers the complementary workflow.

Rendering Pipelines

Real-Time Viewer (SIBR, Nerfstudio, SuperSplat)

Three options for real-time rendering after training:

SIBR is the reference viewer from INRIA, fast but desktop-only. Requires compiled C++ dependencies (LibTorch, GLFW, OpenGL). Can be run headless on a server and proxied.

Nerfstudio/Viser is web-based. After training with ns-train splatfacto, the viewer is accessible in a browser via WebSocket. Works well for remote access to cloud instances.

SuperSplat is an open-source web editor for .splat files. Useful for inspecting and editing trained scenes before deployment, with no GPU required client-side.

WebGPU / WebGL Export

The .splat format (and the compressed .ksplat variant) enables in-browser rendering via WebGPU rasterizers. The export pipeline:

bash
# Convert PLY to .splat format
python convert_to_splat.py --input output/point_cloud/iteration_30000/point_cloud.ply \
  --output scene.splat

Serve the .splat file from a CDN. Rendering runs client-side via WebGPU, so no server GPU is required for web deployments. This is the lowest-cost path for AR/VR experiences that target modern browsers.

Mobile Streaming

For AR/VR use cases where client compute is limited (older headsets, mobile phones), stream rendered views from a server-side GPU. A headless SIBR or Viser instance renders views on-demand based on client camera pose sent via WebSocket. The rendering server handles all GPU work; the client receives compressed video.

Latency is the key constraint: place the rendering server in the same region as your users. The L40S handles 30+ concurrent client streams for typical 1080p scene sizes, at lower cost than H100-class hardware.

Use Case 1: AR/VR Scene Capture

The photo-to-3DGS pipeline for AR/VR is well-established:

  1. Capture 200-500 overlapping photos of the space (phone camera, DSLR, or drone). More overlap means better reconstruction.
  2. Run COLMAP SfM on GPU cloud to extract camera poses and a sparse point cloud. GPU-accelerated COLMAP runs 5-10x faster on A100 vs CPU for feature extraction and matching.
  3. Train 3DGS on H100 or A100 (1-4 hours depending on scene size).
  4. Export to .splat for WebXR deployment, or stream from a rendering server for headset access.

Concrete cost example: a 300-image indoor office scene on A100 GPU rental takes approximately 45 minutes for COLMAP preprocessing + training at $1.10/hr (A100 80GB PCIe on-demand), for a total of about $0.83 per scene. At that rate, you can reconstruct 1,000 scenes per day for under $850 in GPU time.

Use Case 2: Robotics Simulation

3DGS-generated environments are increasingly used to bootstrap robot manipulation training without building hand-authored physics simulation assets. The capture-to-sim workflow:

  1. Capture the real environment with a camera rig.
  2. Train a 3DGS model of the workspace.
  3. Use the 3DGS scene as a photorealistic rendering background in simulation (combined with physics-simulated robot mesh).
  4. Render synthetic training frames from the 3DGS scene for imitation learning.

This approach fills a gap between fully synthetic environments (which look artificial) and pure real-world data collection (which is slow and expensive). The 3DGS background provides the visual realism; the physics simulation provides the ground truth.

For teams working on humanoid robot training pipelines on GPU cloud with models like GR00T N1, 3DGS environments can serve as photorealistic rendering backdrops in Isaac Lab scenes, reducing the visual domain gap between simulation and the real deployment environment.

Training the 3DGS model for a robotics use case typically requires an H100 for robotics simulation training to handle the higher Gaussian counts that come from detailed workspace captures with fine-grained geometry.

Use Case 3: Autonomous Driving with 4D-GS

Autonomous driving datasets involve dynamic scenes that static 3DGS cannot handle. 4D-GS adds a deformation field to model motion: other vehicles, pedestrians, cyclists. The AV pipeline:

  1. Multi-camera capture rig on the vehicle (6-12 cameras, 100k+ frames per session).
  2. LiDAR-aided COLMAP or SLAM-based initialization for accurate camera pose estimation across long driving sequences.
  3. 4D-GS training on a multi-GPU H100 cluster via gsplat DDP (8-12 hours for a full driving session).
  4. Novel view synthesis for data augmentation: generate views from camera angles not present in the original capture, filling blind spots in the training distribution.

This workflow is directly complementary to NVIDIA Cosmos. Cosmos generates fully synthetic environments for rare-event coverage. 4D-GS provides photorealistic novel-view augmentation of real captures. A combined pipeline uses both: Cosmos for generating adverse weather, night driving, and edge-case scenarios from scratch, and 4D-GS for expanding the viewpoint diversity of your real data.

Cost Model: A100 vs L40S vs H100 for Training and Serving

Training Cost for a 100k-Image Scene

GPUOn-Demand ($/hr)Spot ($/hr)Est. Training TimeEst. Cost (on-demand)Est. Cost (spot)
A100 80GB PCIe$1.10N/A10-14 hrs (single GPU)$11.00-15.40N/A
A100 80GB SXM4$1.64$0.4510-14 hrs (single GPU)$16.40-22.96$4.50-6.30
H100 PCIe$2.01N/A6-8 hrs$12.06-16.08N/A
H100 SXM5 x8$4.21/GPUN/A1.5-2 hrs (DDP)$50.52-67.36N/A

Pricing fluctuates based on GPU availability. The prices above are based on 08 May 2026 and may have changed. Check current GPU pricing → for live rates.

For a single-GPU budget, A100 80GB on spot is the most cost-efficient option at $0.45/hr, bringing a 10-14 hour training run down to $4.50-$6.30 total. When wall-clock time matters more than cost, an 8-GPU H100 SXM5 cluster completes the same job in 1.5-2 hours on-demand at $50.52-$67.36.

Rendering Server Cost

GPUOn-Demand ($/hr)VRAMScene Size (10M Gaussians)Concurrent Streams
L40S 48GB$0.7248 GB GDDR6Fits~30 at 1080p
RTX PRO 6000 96GB$1.7096 GB GDDR7Fits comfortably~50+ at 1080p
H100 PCIe 80GB$2.0180 GB HBM3Fits~60+ at 1080p

Pricing fluctuates based on GPU availability. The prices above are based on 08 May 2026 and may have changed. Check current GPU pricing → for live rates.

For pure rendering server use cases, the RTX PRO 6000 for 3DGS rendering offers 96 GB of GDDR7, which comfortably fits even large 10M+ Gaussian scenes with headroom for multiple concurrent preloaded scenes. The L40S is the most cost-efficient option for teams serving a single large scene to 30 or fewer concurrent users.

Split Training/Serving Topology

The Spheron angle for 3DGS is straightforward: spin up H100 or A100 instances for training, then terminate them. Spin up L40S or RTX PRO 6000 instances for long-running rendering servers. Per-minute billing means you pay for training only while it runs, not while the model is being served.

For a team reconstructing 10 new scenes per week and serving them continuously:

  • Training: 10 scenes x $8 avg cost per scene (H100 PCIe, 4-hour outdoor scene) = $80/week
  • Serving: 1x L40S at $0.72/hr x 168 hrs/week = $120.96/week
  • Total: roughly $200/week for a full 3DGS pipeline on cloud, with no infrastructure management overhead

The same setup on dedicated hardware would require an H100 ($25k-35k) for training plus a rendering workstation ($10k-15k), totaling $35k-50k in CapEx before accounting for power, cooling, and maintenance.

Deployment Recipe: Training Cluster and Rendering Server on Spheron

Step 1: Provision and Configure

Rent an H100 or A100 80GB instance on Spheron. After SSH access is confirmed, verify CUDA version and install COLMAP:

bash
nvcc --version
sudo apt-get install colmap

Step 2: Preprocess with COLMAP

bash
colmap feature_extractor \
  --database_path ./colmap.db \
  --image_path ./images \
  --ImageReader.camera_model PINHOLE \
  --SiftExtraction.use_gpu 1

colmap exhaustive_matcher --database_path ./colmap.db

mkdir -p sparse
colmap mapper \
  --database_path ./colmap.db \
  --image_path ./images \
  --output_path ./sparse

The output sparse/0/ directory with cameras.bin, images.bin, and points3D.bin is the input to 3DGS training.

Step 3: Train with gsplat or Inria 3DGS

With Inria's implementation:

bash
python train.py \
  -s /path/to/colmap_output \
  -m /path/to/output_model \
  --iterations 30000 \
  --eval

With gsplat/Nerfstudio:

bash
ns-train splatfacto \
  --data /path/to/colmap_output \
  --max-num-iterations 30000 \
  --output-dir /path/to/output

Monitor VRAM usage during training. If you hit OOM, increase --densification_interval (less frequent densification means fewer Gaussians added per training step) or lower the initial point cloud density via COLMAP's --SiftExtraction.max_num_features flag in the feature_extractor step.

Step 4: Export the PLY

After training, the output is a PLY file at output/point_cloud/iteration_30000/point_cloud.ply. Typical file sizes: 50-200 MB for small scenes, 500 MB to 2 GB for large outdoor captures. The PLY contains all Gaussian parameters: position, covariance matrix, opacity, and spherical harmonic coefficients for view-dependent color.

Step 5: Deploy the Rendering Server

On a separate L40S instance, run the Nerfstudio/Viser viewer:

bash
ns-viewer --load-config /path/to/output/config.yml

This starts a web-accessible viewer on port 7007 by default. Configure NGINX as a reverse proxy to expose it on port 443 with a certificate. Client connections via browser get real-time rendered views at the server-side camera pose streamed back as compressed video.

Optimization: Hierarchical 3DGS, LOD, and Compression

Three techniques reduce compute requirements and improve quality for large scenes:

  1. Hierarchical 3DGS (HGSS) splits the scene into chunks processed at different detail levels. Reduces peak VRAM by 30-40% for large outdoor scenes by only loading the detail-level chunks relevant to the current viewpoint.
  1. Level of detail renders high-density Gaussians near the camera and coarser representations at distance. Nerfstudio's LOD extensions support this. The visual impact of distant Gaussians is limited by pixel footprint, so reducing their density has minimal quality cost.
  1. Self-organizing Gaussians and compression. After training, typically 30-50% of Gaussians are near-transparent and contribute little to the final render. Pruning these plus quantizing spherical harmonic coefficients can reduce PLY size by 5-10x with minimal quality loss. Mini-Splatting and Compact3DGS are open-source implementations.

For teams running large AV datasets, combining hierarchical loading with aggressive Gaussian pruning is the standard path to making 10M+ Gaussian scenes serve at interactive frame rates on a single L40S.

The Bottom Line

3DGS is a production technology. The tooling (gsplat, Nerfstudio, 4D-GS) is stable, GPU requirements are well understood, and the training-to-serving topology maps naturally to on-demand GPU cloud instances. The split profile (high-VRAM training on H100/A100, cost-efficient rendering on L40S/RTX PRO 6000) makes GPU cloud more economical than owning dedicated hardware for most teams. Run training on H100 or A100 instances, then serve results on an L40S for rendering at $0.72/hr without keeping your training hardware idle.


3D Gaussian Splatting production pipelines split naturally into high-VRAM training (H100, A100) and cost-efficient rendering (L40S, RTX PRO 6000). Spheron's mixed-instance access lets you match each stage to the right GPU without long-term commitment.

Rent H100 for 3DGS training → | Rent L40S for rendering → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.