NVIDIA DGX Spark and GPU Cloud: Local-to-Cloud AI Pipeline (2026)

DGX Spark ships with 1 PFLOP of FP4 compute (with sparsity) and 128GB of unified memory for $4,699. That covers a lot of use cases. But there are specific points where it hits its ceiling, and knowing where those are before you reach them saves real time.

This guide covers the local-to-cloud workflow: what to run on DGX Spark, when to move to cloud GPUs, and exactly how to make that transition with minimal friction.

What Is DGX Spark

DGX Spark is NVIDIA's desktop AI computer, announced as Project DIGITS and now shipping commercially. It runs the GB10 Grace Blackwell Superchip, combining an ARM-based Grace CPU with a Blackwell GPU in a single chip sharing 128GB of LPDDR5x unified memory.

The key specs:

Spec	DGX Spark	H100 PCIe (80GB)	H100 SXM5 (80GB)
Memory	128GB unified	80GB HBM2e	80GB HBM3
Compute	1 PFLOP FP4 (with sparsity)	~3.0 POPS INT8 (with sparsity)	~4.0 POPS INT8 (with sparsity)
Bandwidth	~273 GB/s (LPDDR5x)	2.0 TB/s (HBM2e)	3.35 TB/s (HBM3)
Networking	10GbE + ConnectX-7 200Gb	InfiniBand-optional	InfiniBand (400 Gb/s)
Concurrent users	1-3 (dev)	50-500+ (prod)	50-500+ (prod)
Uptime guarantee	None	99.9% SLA	99.9% SLA
Cost	$4,699 one-time	$2.01/hr on-demand	$2.57/hr on-demand

It runs standard PyTorch, HuggingFace Transformers, and vLLM. Any model that works in the CUDA ecosystem works on DGX Spark without modification.

What DGX Spark Can Run

128GB of unified memory is enough for larger models than any single discrete GPU currently on the market. With quantization:

Llama 4 Scout (109B, FP8): Fits comfortably, ~109GB after FP8 compression.
DeepSeek-R1-Distill-Llama-70B (BF16): ~140GB at full precision, so use FP8 (reduces to ~70GB) or Q4 quantization (~35-40GB).
Qwen2.5 (72B, FP16): ~144GB at FP16, needs FP8 or Q4 to fit.
Models up to 200B with INT4: Practical upper bound with aggressive quantization.

The 200B ceiling is not a hard limit. It depends on which quantization format you use and how much context length you need. At INT4 with shorter contexts, larger models fit. At BF16 with long contexts, smaller models need more memory.

Where DGX Spark Hits Its Limit

Multi-user concurrency. Each user request needs KV cache allocation. At 5-10 simultaneous users with a 70B model, the KV cache fills the available memory and requests start queueing. A single H100 with vLLM handles this via PagedAttention; DGX Spark's lower memory bandwidth makes the problem worse under load.

Long contexts above 32k tokens. KV cache scales linearly with context length. A 32k-token context for a 70B model can consume 20-40GB of additional memory, leaving less room for other requests.

Models above 200B parameters. Even with INT4 quantization, a 200B model needs roughly 100GB. A 405B model (Llama 3.1 405B) in Q4 needs ~200GB. That requires multi-GPU cloud instances.

24/7 production SLAs. DGX Spark is a desktop computer. It has no managed uptime, no automatic failover, and no replacement SLA. If it crashes at 3am, someone has to restart it manually.

Multi-GPU tensor parallelism. Splitting a model across GPUs for higher throughput requires NVLink or InfiniBand between GPUs. DGX Spark is a single-GPU system. Cloud instances with InfiniBand fabric handle tensor parallelism at scale.

The Local-to-Cloud AI Development Workflow

The most cost-efficient approach for teams building LLM applications splits work between two phases:

Phase 1: Local (DGX Spark)
  - Model selection and initial testing
  - Fine-tuning on domain data
  - Prompt engineering and evals
  - Integration testing
  - Quantization experiments

Phase 2: Cloud (Spheron GPU)
  - Production inference serving
  - Multi-user concurrent access
  - 24/7 availability
  - Models above 200B parameters
  - High-throughput batch jobs

This split beats "all cloud" because dev iteration on cloud with an H100 at $2.01/hr adds up fast. A three-person team running 8 hours per day of dev costs $4,342 per quarter on cloud GPUs. One DGX Spark shared by three developers costs $4,699 total, and it breaks even in about 97 days.

It also beats "all local" because DGX Spark cannot provide the uptime, concurrency, or multi-GPU throughput that production services need.

For the full deployment lifecycle from prototype to production, see the LLM deployment guide.

Develop and Test Locally on DGX Spark

Set up a local vLLM server as the development inference endpoint:

bash

# Install dependencies on DGX Spark
pip install vllm transformers torch

# Start a local inference server
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85

# Test with curl
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", "prompt": "Hello", "max_tokens": 100}'

A few things worth doing during the dev phase:

Use FP8 or INT8 quantization to fit larger models without sacrificing much quality. Add --dtype fp8 for FP8 on Blackwell hardware. For older models, --quantization awq works on most consumer and workstation GPUs.

Profile with nvidia-smi. Run nvidia-smi dmon -s u in a separate terminal while running inference. Watch memory usage and utilization percentage to understand how close you are to the memory ceiling.

Set the environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce OOM errors when memory allocations fragment the GPU memory pool. This is especially useful on unified memory systems like DGX Spark.

For simpler local serving without vLLM, Ollama is a lower-friction option. See running LLMs locally with Ollama for that workflow.

Containerize for Cloud Portability

Use the same Docker image in development and production. This eliminates the "works on my machine" class of bugs before they reach production.

dockerfile

FROM vllm/vllm-openai:latest

WORKDIR /app

ENV MODEL_NAME="meta-llama/Llama-4-Scout-17B-16E-Instruct"
ENV MAX_MODEL_LEN=32768
ENV DTYPE=fp8

ENTRYPOINT []
CMD exec python -m vllm.entrypoints.openai.api_server \
    --model "${MODEL_NAME}" \
    --dtype "${DTYPE}" \
    --max-model-len "${MAX_MODEL_LEN}" \
    --host 0.0.0.0 --port 8000

Build and run locally:

bash

# Build image
docker build -t my-llm-server .

# Run on DGX Spark
docker run --gpus all --ipc=host \
  -p 127.0.0.1:8000:8000 \
  -e MODEL_NAME="meta-llama/Llama-4-Scout-17B-16E-Instruct" \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  my-llm-server

When you deploy this same image on a Spheron GPU instance, it runs identically. The environment variables let you swap models without rebuilding the image.

Deploy to GPU Cloud for Production

Step 1: Provision a Spheron Instance

Go to app.spheron.ai and select a GPU instance based on your model's VRAM requirements. Use on-demand for always-on production serving, or spot instances for fault-tolerant batch jobs at lower cost.

GPU selection by model size:

Model Size	Recommended GPU	VRAM Needed	Spheron Price
Up to 70B (FP8)	H100 PCIe 80GB	~50GB	$2.01/hr
Up to 70B (BF16)	2x H100 SXM5 80GB	~140GB	~$5.14/hr
70B-140B (FP8)	2x H100 PCIe	~160GB	~$4.02/hr
141GB+ (single GPU)	H200 SXM5	141GB HBM3e	$3.69/hr
200B+ (quantized)	4x H100 PCIe	320GB+	~$8.04/hr

After provisioning, SSH in and verify the GPU:

bash

ssh root@<spheron-ip>
nvidia-smi

Step 2: Transfer Model Weights

If you fine-tuned a model on DGX Spark, transfer the weights to the cloud instance:

bash

# From DGX Spark, sync weights to cloud instance
rsync -avz --progress \
  ~/.cache/huggingface/hub/models--meta-llama--Llama-4-Scout-17B-16E-Instruct/ \
  root@<spheron-ip>:/root/.cache/huggingface/hub/models--meta-llama--Llama-4-Scout-17B-16E-Instruct/

For base models you have not fine-tuned, skip the transfer and pull directly from HuggingFace Hub on the cloud instance. That avoids the upload time if your internet connection is slower than the cloud instance's download speed.

bash

# On Spheron instance, pull from HuggingFace Hub directly
pip install huggingface_hub
huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct

Step 3: Start the Production Server

bash

# On Spheron H100 instance
docker run --gpus all --ipc=host \
  -p 127.0.0.1:8000:8000 \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --tensor-parallel-size 1

For multi-GPU inference with tensor parallelism, provision a single instance with multiple GPUs (e.g., a 2x H100 node) and set --tensor-parallel-size 2. The --tensor-parallel-size flag requires multiple GPUs on the same host sharing fast interconnect. It does not work across separate cloud instances. See the vLLM production deployment guide for multi-GPU configuration.

Step 4: Point Your App at the Cloud Endpoint

vLLM exposes an OpenAI-compatible API. The only change required in your application code is the base URL:

python

from openai import OpenAI

# Local development (DGX Spark)
# client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# Production (Spheron cloud, see TLS note below)
client = OpenAI(base_url="https://<your-domain>:443/v1", api_key="<your-token>")

response = client.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    prompt="Summarize this document:",
    max_tokens=500,
)

No other code changes. Local and production use the same client interface.

TLS and authentication are required before exposing this endpoint to any network. Do not connect your application directly to http://<spheron-ip>:8000 in production. vLLM itself has no TLS support or API key enforcement, so you need a reverse proxy in front of it:

Nginx or Caddy — terminate TLS at the reverse proxy and forward plain HTTP to localhost:8000 on the same instance. Caddy generates and renews Let's Encrypt certificates automatically. Nginx requires a certificate from Let's Encrypt or your CA.
API key enforcement — add an Authorization header check in the Nginx config (e.g., auth_request) or use a lightweight gateway such as Traefik with a middleware plugin.
Loopback bind (primary control) — the -p 127.0.0.1:8000:8000 flag in the docker run command restricts vLLM to loopback only. Docker bypasses UFW by inserting iptables rules directly, so ufw deny 8000 alone does not block a port published with -p 8000:8000. Binding to 127.0.0.1 prevents Docker from publishing the port to public interfaces entirely. ufw deny 8000 can be kept as defense-in-depth but is not the primary control.

Without these steps, all prompts and completions travel over plaintext HTTP and any host that can reach your instance IP can consume compute with no authentication.

When DGX Spark Is Enough vs When You Need GPU Cloud

Scenario	DGX Spark	GPU Cloud	Why
Single developer testing prompts	Yes	No	No concurrency needed
Fine-tuning a 70B model in FP8	Yes	Maybe	Fits in 128GB with ~70GB weights
Serving 5+ concurrent users	No	Yes	KV cache exhaustion under load
200B+ model inference	No	Yes	Memory ceiling
24/7 production API	No	Yes	No SLA, manual restarts
Batch embedding jobs overnight	No	Yes	Cloud is faster and frees DGX Spark for dev
Training from scratch above 7B	No	Yes	Throughput too low, no multi-GPU
Long contexts above 32k at scale	No	Yes	KV cache pressure with multiple users

The clearest signal for moving to cloud: your DGX Spark becomes a production dependency instead of a development tool. Once it does, you lose its value as a dev machine and you are running production workloads on hardware with no SLA.

Cost Analysis: DGX Spark + Cloud vs Full Cloud

Scenario: a team of three developers spending 3 months building and iterating on a 70B assistant, followed by 9 months of production serving with one always-on H100.

Cloud GPU rate used: H100 PCIe at $2.01/hr on-demand (from Spheron API, 17 Apr 2026).

Dev compute:

Cloud-only: 3 devs × 8 hrs/day × 90 days × $2.01 = $4,342
DGX Spark ×1 (shared): $4,699 one-time

Production compute (months 4-12):

1x H100 PCIe × 24 hrs × 270 days × $2.01 = $13,025

Approach	Dev Cost	Prod Cost	Year 1 Total	Year 2+ Dev Cost
Full cloud (dev + prod H100)	$4,342	$13,025	$17,367	$4,342/yr
DGX Spark ×3 + prod H100	$14,097	$13,025	$27,122	$0
DGX Spark ×1 shared + prod H100	$4,699	$13,025	$17,724	$0

The one shared DGX Spark scenario costs about the same as full cloud in year 1. In year 2, the team saves $4,342 in dev compute because the hardware is already paid for. By month 10 of year 2, the DGX Spark has paid for itself compared to the full-cloud approach.

For a 3-person team that keeps developing after year 1, the DGX Spark breaks even at roughly 97 days of shared dev usage (4,699 / (3 × 2.01 × 8) = 97 days). That is about 3 months of normal development cadence. For a broader look at GPU cost strategy by funding stage, see GPU cloud for AI startups in 2026.

Pricing fluctuates based on GPU availability. The prices above are based on 17 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Monitoring and Production Operations

Moving to cloud changes what you need to watch. On DGX Spark you can run nvidia-smi dmon in a terminal. In production you need something that keeps running and pages you when things go wrong.

Set up Prometheus scraping on vLLM's /metrics endpoint. The two metrics that matter most: vllm:num_requests_waiting (request queue depth, alerts when it climbs above your target SLA) and vllm:gpu_cache_usage_perc (KV cache fill rate, alerts at 90%+ to indicate the model needs more VRAM or fewer concurrent requests).

Wrap vLLM in a systemd service so it restarts automatically if it crashes:

ini

[Unit]
Description=vLLM inference server
After=network.target

[Service]
ExecStart=/usr/bin/docker run --rm --gpus all --ipc=host \
  -p 127.0.0.1:8000:8000 \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --dtype fp8 \
  --max-model-len 32768
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

For the full monitoring setup including DCGM and GPU utilization dashboards, see GPU monitoring for ML.

DGX Spark covers local development. When you are ready to serve production traffic, run multi-GPU inference, or need guaranteed uptime, Spheron gives you H100 and H200 instances with per-minute billing and no contracts.
Spheron H100 → | On-demand H200 → | View all pricing →
Get started on Spheron →

STEPS / 05

Quick Setup Guide

Develop and validate your model locally on DGX Spark
Install vLLM on DGX Spark with `pip install vllm transformers torch`. Start a local inference server with `vllm serve <model-name> --dtype fp8 --max-model-len 32768 --gpu-memory-utilization 0.85`. Test your prompts, run evals, and iterate on fine-tuning until output quality meets your requirements. Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to reduce OOM errors on large models.
Containerize your model with Docker for portability
Write a Dockerfile based on `vllm/vllm-openai:latest` that accepts model name, dtype, and context length as environment variables. Build and test the image locally on DGX Spark. Using the same Docker image locally and in production eliminates environment differences that cause bugs to appear only in cloud.
Provision a GPU cloud instance on Spheron
Go to app.spheron.ai and select a GPU instance sized for your model. For 70B FP8, a single H100 PCIe (80GB) at $2.01/hr is sufficient. For 70B BF16, you need roughly 140GB VRAM, so choose two H100 SXM5 instances or a single H200 (141GB). For 140B with FP8, two H100 instances also work. SSH into the instance and run `nvidia-smi` to confirm the GPU is available before deploying.
Transfer model weights and deploy via vLLM
Transfer model weights using `rsync -avz ~/.cache/huggingface/hub/models--<model>/ root@<spheron-ip>:/root/.cache/huggingface/hub/models--<model>/` or pull them directly with the HuggingFace Hub CLI on the cloud instance. Run your Docker container with `docker run --gpus all --ipc=host -p 127.0.0.1:8000:8000 -v /root/.cache/huggingface:/root/.cache/huggingface vllm/vllm-openai:latest --model <model-name> --dtype fp8 --gpu-memory-utilization 0.90`. The `-p 127.0.0.1:8000:8000` flag binds the port to loopback only so the vLLM endpoint is not reachable from the public internet directly. The `-v` flag mounts the host HuggingFace cache into the container so vLLM uses the weights you transferred rather than downloading them again from HuggingFace Hub.
Set up production monitoring and autoscaling
Configure Prometheus to scrape vLLM's /metrics endpoint. Set alerts on `vllm:num_requests_waiting` (queue buildup) and `vllm:gpu_cache_usage_perc` (KV cache pressure). Wrap vLLM in a systemd service with `Restart=always` for automatic recovery. Add an Nginx upstream block when you need more than one instance for load distribution.

FAQ / 05

Frequently Asked Questions

NVIDIA DGX Spark (formerly Project DIGITS) is a desktop AI computer built around the GB10 Grace Blackwell Superchip. It delivers 1 PFLOP of FP4 compute (with sparsity) and 128GB of unified LPDDR5x memory for $4,699. Compared to a single H100 PCIe cloud GPU, DGX Spark has more total memory (128GB vs 80GB) but lower memory bandwidth (~273 GB/s vs 2.0 TB/s) and lower compute throughput (1 PFLOP FP4 vs ~3.0 PFLOP INT8 with sparsity). Cloud GPUs are faster for throughput-sensitive production workloads, but DGX Spark is a better fit for local development, fine-tuning, and single-user inference.

For single-developer or small team internal tooling with low concurrency, DGX Spark can handle production inference. It runs models up to about 200B parameters with quantization. The limits appear at 5+ concurrent users (KV cache pressure), 24/7 uptime requirements (no managed SLA, manual restarts), long contexts above 32k tokens at scale, and anything requiring more than one user's worth of throughput. For a public API or a team-facing service with real traffic, cloud GPUs are needed.

The simplest path is to develop and test on DGX Spark using vLLM, then containerize the same vLLM setup in a Docker image. Deploy that image on a Spheron GPU cloud instance. Model weights can be transferred via rsync from the DGX Spark's HuggingFace cache to the cloud instance, or you can pull them directly from HuggingFace Hub on the cloud instance if bandwidth is not a constraint. The vLLM OpenAI-compatible API means client code requires no changes, just a URL swap.

DGX Spark is a one-time $4,699 purchase with no per-hour cost. A single H100 PCIe instance on Spheron costs $2.01/hr on-demand. For a single developer using 8 hours per day, the DGX Spark breaks even in roughly 292 days of dev usage. For three developers sharing one DGX Spark, break-even drops to about 97 days. After break-even, dev compute on DGX Spark is free. Cloud GPU is better for production: you pay only for the hours you run, with no hardware commitment and per-second billing.

The right cloud GPU depends on your model size. For 70B models in FP8, a single H100 PCIe (80GB, $2.01/hr) is the natural step up. For 70B in BF16 full precision, you need roughly 140GB VRAM (70B params x 2 bytes), so a single H100 PCIe or SXM5 at 80GB is not enough. You need two H100 SXM5 instances (~160GB total) or a single H200 (141GB HBM3e). For models in the 140B-200B range with quantization, two H100 instances are the practical option. If you need the highest throughput per dollar for production, H100 SXM5 offers 3.35 TB/s memory bandwidth versus H100 PCIe's 2.0 TB/s.

What Is DGX Spark

What DGX Spark Can Run

Where DGX Spark Hits Its Limit

The Local-to-Cloud AI Development Workflow

Develop and Test Locally on DGX Spark

Containerize for Cloud Portability

Deploy to GPU Cloud for Production

Step 1: Provision a Spheron Instance

Step 2: Transfer Model Weights

Step 3: Start the Production Server

Step 4: Point Your App at the Cloud Endpoint

When DGX Spark Is Enough vs When You Need GPU Cloud

Cost Analysis: DGX Spark + Cloud vs Full Cloud

Monitoring and Production Operations

Quick Setup Guide

Develop and validate your model locally on DGX Spark

Containerize your model with Docker for portability

Provision a GPU cloud instance on Spheron

Transfer model weights and deploy via vLLM

Set up production monitoring and autoscaling

Frequently Asked Questions

01What is NVIDIA DGX Spark and how does it compare to cloud GPUs?

02Can DGX Spark run production inference without cloud GPUs?

03How do I move a model from DGX Spark to GPU cloud for production?

04What is the cost difference between DGX Spark and GPU cloud for inference?

05Which cloud GPU is best for scaling from DGX Spark?

Build what's next.