Tutorial

Deploy OpenVLA on GPU Cloud: Self-Host the Open Vision-Language-Action Robotics Foundation Model (2026 Setup Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 2, 2026
OpenVLAVLA Model DeploymentRobotics Foundation ModelVision Language Action ModelOpenVLA 7BGPU CloudRobot LearningSelf-Hosted AIOpen X-EmbodimentFine-TuningAction ChunkingClosed-Loop Robot Control
Deploy OpenVLA on GPU Cloud: Self-Host the Open Vision-Language-Action Robotics Foundation Model (2026 Setup Guide)

OpenVLA is not a VLM that returns captions or answers questions. It returns robot actions. That single difference changes every deployment decision: model call latency has a hard deadline set by your robot's control frequency, not by user patience. A 10 Hz control loop gives you 100 ms per step. A network round-trip to an external API typically eats 50-200 ms before the model even runs. That math rules out remote inference for any closed-loop robot that needs to react in real time.

RT-2 (Google) and Pi-0 (Physical Intelligence) are the best-known alternatives, but both are closed-source API-only services. OpenVLA is the only fully open-weight vision-language-action model in the 7B range, released under an MIT license. You can fine-tune it on your robot's proprietary demonstration data, run it on your own hardware with no per-call fees, and modify the action tokenizer if your robot's action space differs from the training distribution.

For context on how general vision-language models differ from action-producing models, see Deploy Vision Language Models on GPU Cloud.

What Is OpenVLA

OpenVLA 7B is built on Prismatic-7B, a vision-language model that uses a dual ViT encoder: SigLIP for high-level semantic features and DinoV2 for spatial detail. The language backbone is a 7B Llama-2-based model. Together, the Prismatic ViT encoder plus the LM decoder produce a model that takes an RGB image plus a natural language instruction and generates a robot action.

The action space is 7-DoF: x, y, z translation, roll, pitch, yaw rotation, and gripper open/close. Each continuous float32 value in the action vector gets discretized into one of 256 bins, then mapped to a token ID. Rather than extending the vocabulary, OpenVLA overwrites the 256 least-used tokens in the Llama tokenizer vocabulary with these action bin tokens. A single action step produces 7 output tokens, one per dimension. The vocabulary extension is what requires --trust-remote-code when loading the model in vLLM or HuggingFace.

PropertyValue
Parameters~8B
Base modelPrismatic-7B (SigLIP + DinoV2 + Llama-2 7B backbone)
Action space7-DoF delta actions (x, y, z, roll, pitch, yaw, gripper)
Action tokens256 discrete bins per dimension, 7 tokens per step
Context length1024 tokens (image tokens + instruction + action)
Training dataOpen X-Embodiment: ~970k curated episodes, 22 embodiments
LicenseMIT

Training used a curated subset of the Open X-Embodiment (OXE) dataset, which aggregates demonstrations from 970,000+ episodes across 22 robot embodiments. The mix spans tabletop manipulation, mobile manipulation, and navigation, including both simulated and real-robot data.

Why Self-Host Instead of Using an API

Closed-loop latency. A 10 Hz control loop gives 100 ms per step. Cloud API round-trips typically add 50-200 ms in network latency before the model runs, which consumes the entire step budget. A self-hosted H100 can return an action in under 150 ms including image preprocessing and action de-tokenization, keeping the network out of the critical path entirely. For sub-100 ms loops, OpenVLA-OFT (the parallel-decoding follow-up) is worth evaluating as it removes the sequential autoregressive bottleneck.

Data residency. Live sensor streams and proprietary demonstration data cannot go to an external API in defense robotics, medical robotics, and any context where the robot's observation data is commercially sensitive. Your demonstration data represents months of operator time; it is a competitive asset. Running inference on-premise or in a private cloud instance means that data never leaves your infrastructure.

Fine-tuning on proprietary embodiments. RT-2 and Pi-0 have no public fine-tuning API. If your robot has a different arm configuration, gripper type, or observation setup from the training distribution, you are stuck with the base model's generalization. OpenVLA's LoRA fine-tuning workflow lets you adapt the model to a new embodiment in hours on a single H100. See the GRPO fine-tuning guide for GPU memory math that also applies to LoRA VLA training.

GPU Sizing for OpenVLA Inference

OpenVLA 7B in BF16 occupies approximately 14-15 GB of VRAM for weights. The practical minimum for production serving with KV cache and visual encoder workspace is an A100 40GB: the weights fit in under half the card's memory, leaving 25+ GB for the ViT encoder intermediates and action token KV cache.

For closed-loop control, the H100's memory bandwidth advantage (3.35 TB/s vs the A100's 2 TB/s) translates directly to faster per-step decode. The action is only 7 tokens, but decode throughput at batch size 1 is almost entirely memory-bandwidth-bound.

GPUVRAMPrecisionEst. Latency (per step)On-Demand $/hrBest For
H100 SXM5 80GB80 GBBF16~100-150 ms$3.10Real-time control (<150 ms), multi-robot fleets
H100 SXM5 80GB80 GBFP8~80 ms$3.10Highest throughput, fleet scale
A100 80GB SXM480 GBBF16~150 ms$1.64Pick-and-place, 5-10 Hz loops
L40S 48GB48 GBFP8~200 ms$0.72Cost-sensitive, 2-5 Hz loops

Latency estimates are based on H100 and A100 memory bandwidth at batch size 1 for an ~8B parameter model generating 7 tokens. Actual numbers depend on your image resolution and preprocessing pipeline. The OpenVLA paper measured ~200 ms per step on a single A100 for autoregressive decoding; H100's higher memory bandwidth brings this down, but sub-100 ms reliably requires OpenVLA-OFT's parallel decoding approach. L40S supports FP8 via the --quantization fp8 flag in vLLM (Ada Lovelace architecture).

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

For H100 instances, see the H100 rental page. For A100 instances, see the A100 rental page. Both are available on-demand with provisioning in under 90 seconds.

Inference Setup with vLLM

Note: vLLM's OpenVLA support is experimental. OpenVLA is not on vLLM's official supported model list as of this writing (see tracking issue vllm-project/vllm#14739). The --trust-remote-code flag enables the custom architecture, but results can vary by vLLM version because OpenVLA's fused SigLIP+DinoV2 visual encoder is not a standard vLLM-supported component. For production workloads, the officially documented path is the HuggingFace predict_action() API. The vLLM path below is useful for teams that need concurrent multi-robot requests and are willing to validate on their specific vLLM version.

vLLM treats OpenVLA as a standard causal language model with a custom vocabulary extension. The 256 action tokens are part of the model's vocab; vLLM generates them as token IDs. The conversion from token IDs to a continuous action vector happens on the client side. vLLM does not need to know that some tokens represent actions, not words.

Install dependencies:

bash
pip install "vllm>=0.6.0" transformers>=4.40
pip install git+https://github.com/openvla/openvla.git

Download weights:

bash
huggingface-cli download openvla/openvla-7b \
  --local-dir /data/models/openvla-7b

Verify the repo name on Hugging Face before downloading. Model repository naming can change between releases.

Launch the vLLM server (BF16, single H100 or A100 80GB):

bash
vllm serve /data/models/openvla-7b \
  --dtype bfloat16 \
  --max-model-len 1024 \
  --served-model-name openvla \
  --trust-remote-code \
  --port 8000

--trust-remote-code is required. OpenVLA uses a custom model class with the action token vocabulary extension. Without this flag, vLLM will reject the model config.

Launch the vLLM server (FP8, H100 or L40S):

bash
vllm serve /data/models/openvla-7b \
  --dtype bfloat16 \
  --quantization fp8 \
  --max-model-len 1024 \
  --served-model-name openvla \
  --trust-remote-code \
  --port 8000

Python client with action de-tokenization:

python
import base64
import numpy as np
from openai import OpenAI
from PIL import Image
from transformers import AutoProcessor

# Load the OpenVLA processor for action de-tokenization
processor = AutoProcessor.from_pretrained(
    "/data/models/openvla-7b",
    trust_remote_code=True
)

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def get_action(image: Image.Image, instruction: str) -> np.ndarray:
    # Encode image to base64
    import io
    buf = io.BytesIO()
    image.save(buf, format="JPEG")
    img_b64 = base64.b64encode(buf.getvalue()).decode()

    response = client.chat.completions.create(
        model="openvla",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
                },
                {
                    "type": "text",
                    "text": f"What action should the robot take to {instruction}?"
                }
            ]
        }],
        max_tokens=7,    # one token per action dimension
        temperature=0.0
    )

    # vLLM returns the generated tokens as text.
    # For direct HuggingFace usage (not vLLM), the documented API is:
    #   action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
    # For the vLLM path, verify the exact decode method against the current
    # OpenVLA repo (https://github.com/openvla/openvla) before using in production.
    if not response.choices:
        raise RuntimeError("vLLM returned no completions; check server logs for errors")
    generated_text = response.choices[0].message.content
    action = processor.decode_actions(generated_text)
    return action  # shape: (7,) with [x, y, z, roll, pitch, yaw, gripper]

The --max-model-len 1024 setting is appropriate because OpenVLA observations are short: one image tokenizes to a few hundred visual tokens, and the instruction plus action generation adds very little additional context. Keeping max-model-len small reduces KV cache pre-allocation and lets you fit more concurrent requests into available VRAM.

TensorRT-LLM Engine Build (Optional, for Sub-100 ms)

The Prismatic ViT encoder is the latency bottleneck at batch size 1. The language model decoder generates only 7 tokens per step, so it finishes quickly. The ViT forward pass, which processes the observation image into visual embeddings, is what determines whether you hit the 80 ms or 150 ms mark.

Building a TensorRT-LLM engine for the Prismatic ViT encoder compiles it to a fixed-shape CUDA kernel tuned for your specific input resolution. At batch size 1, this typically runs 1.5-2x faster than the PyTorch reference implementation. The general workflow is:

bash
# Install TensorRT-LLM
pip install tensorrt-llm

# Export the ViT encoder to ONNX, then build with trtllm-build
# Exact flags depend on your input resolution and GPU generation
trtllm-build \
  --model_dir /data/models/openvla-7b/vision_encoder \
  --output_dir /data/engines/openvla-vit \
  --dtype bfloat16

This section describes the general TRT-LLM approach. Specific trtllm-build flags for the Prismatic ViT encoder change between OpenVLA releases. Check the OpenVLA GitHub issues for current TRT-LLM compatibility status before investing build time.

Use this path only if the vLLM serving approach exceeds your latency target. The vLLM path is simpler, maintains compatibility with new releases, and works well for 5 Hz loops and above.

Production Latency Tuning

Image Preprocessing Pipeline

OpenCV resize and normalization on the robot controller should run in a background thread. The goal is to have the preprocessed tensor ready before the previous action chunk finishes executing, so the GPU call starts immediately at the end of execution rather than waiting for CPU preprocessing.

A typical pipeline: when the controller starts executing action chunk N, it submits the current observation image to the preprocessing thread. By the time the last action in chunk N executes, the preprocessed tensor for chunk N+1 is ready to send. This overlaps CPU preprocessing with robot motion and eliminates preprocessing stalls from the GPU call latency.

Action Chunking

Instead of calling the model once per control step, request a chunk of 8 to 16 actions in a single forward pass. Your robot controller executes the chunk while the GPU decodes the next one.

The right chunk size depends on two factors: your controller's replanning tolerance (how quickly you need to react to unexpected obstacles or trajectory deviations), and your control frequency. A 10 Hz controller with 8-action chunks replans every 800 ms. A 5 Hz controller with 4-action chunks replans every 800 ms as well. Longer chunks reduce replanning frequency and amortize model call overhead, but increase tracking error on curved paths because the model does not see intermediate observations.

A practical starting point: set chunk size so that the execution time for one chunk equals roughly 1.5x the model call latency. This gives the GPU time to finish decoding the next chunk before the controller needs it.

Control Loop Integration

The producer-consumer pattern works well for overlapping GPU inference with robot execution. The GPU inference thread pulls an observation from the queue, calls the model, and pushes an action chunk to the output queue. The robot execution thread pulls chunks from the output queue and sends commands to the controller at the target frequency.

python
import asyncio
import queue

CONTROL_FREQUENCY_HZ = 10  # target control loop frequency

observation_queue = queue.Queue(maxsize=2)
action_queue = queue.Queue(maxsize=2)

def get_action_chunk(image: Image.Image, instruction: str, chunk_size: int = 8) -> list:
    import io
    buf = io.BytesIO()
    image.save(buf, format="JPEG")
    img_b64 = base64.b64encode(buf.getvalue()).decode()

    response = client.chat.completions.create(
        model="openvla",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_b64}"}
                },
                {
                    "type": "text",
                    "text": f"What action should the robot take to {instruction}?"
                }
            ]
        }],
        max_tokens=7 * chunk_size,  # 7 action dimensions per step
        temperature=0.0
    )

    if not response.choices:
        raise RuntimeError("vLLM returned no completions; check server logs for errors")
    generated_text = response.choices[0].message.content
    # decode_actions returns a flat array of shape (7 * chunk_size,)
    all_actions = processor.decode_actions(generated_text)
    return [all_actions[i * 7:(i + 1) * 7] for i in range(chunk_size)]

async def inference_worker():
    while True:
        obs, instruction = await asyncio.get_running_loop().run_in_executor(
            None, observation_queue.get
        )
        action_chunk = await asyncio.get_running_loop().run_in_executor(
            None, get_action_chunk, obs, instruction, 8
        )
        await asyncio.get_running_loop().run_in_executor(None, action_queue.put, action_chunk)

async def execution_worker(robot_controller):
    while True:
        chunk = await asyncio.get_running_loop().run_in_executor(
            None, action_queue.get
        )
        for action in chunk:
            robot_controller.send(action)
            await asyncio.sleep(1.0 / CONTROL_FREQUENCY_HZ)

For a broader look at prefill-decode disaggregation patterns that can further reduce tail latency in multi-robot serving, see Prefill-Decode Disaggregation on GPU Cloud.

Fine-Tuning OpenVLA on a Custom Embodiment

Data Preparation

Record demonstrations as (RGB observation image, natural language instruction, action vector) triples. Each step in a demonstration is one training example. The action vector is a float32 array of 7 values in your robot's action space.

OpenVLA's normalization scripts convert your action vectors to the 256-bin discrete token format the model expects. The normalization is per-dimension and computed from statistics across your training dataset. Run normalize_actions.py from the official OpenVLA repo to generate the normalization statistics before training.

bash
# Generate action normalization statistics from your dataset
python normalize_actions.py \
  --dataset_path /data/demos/my_robot \
  --output_path /data/demos/my_robot/action_stats.json

LoRA Setup

LoRA on the language backbone is the practical path for cross-embodiment fine-tuning. The recommended configuration:

python
from peft import LoraConfig

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "out_proj"],  # LM backbone only
    bias="none",
    task_type="CAUSAL_LM"
)

Apply LoRA only to the LM backbone, not the visual encoder. The ViT encoder handles observation features that transfer well across embodiments. The LM backbone is where the embodiment-specific action mapping lives. Applying LoRA to the ViT can hurt cross-embodiment transfer if your dataset is small.

Training Command

bash
python finetune.py \
  --model_path /data/models/openvla-7b \
  --dataset_path /data/demos/my_robot \
  --action_stats_path /data/demos/my_robot/action_stats.json \
  --use_lora True \
  --lora_rank 32 \
  --lora_alpha 64 \
  --batch_size 16 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-4 \
  --num_steps 10000 \
  --save_every 500 \
  --output_dir /data/checkpoints/my_robot_lora

GPU-Hour Budget

Dataset SizeGPUEstimated TimeEstimated Cost
1,000 demosH100 SXM5 80GB~1.5 hrs~$4.65
10,000 demosH100 SXM5 80GB~6 hrs~$18.60
50,000 demosH100 SXM5 80GB~28 hrs~$86.80

Cost is based on $3.10/hr on-demand H100 SXM5 pricing. Demo count assumes each demonstration is roughly 100-200 steps at 10 Hz, which is typical for tabletop manipulation tasks.

Pricing fluctuates based on GPU availability. The prices above are based on 02 May 2026 and may have changed. Check current GPU pricing → for live rates.

For budget fine-tuning runs, the L40S on Spheron is a cost-effective option for small datasets (under 5,000 demos) at FP8 quantization.

Evaluation with LIBERO

The LIBERO simulation benchmark, used in the original OpenVLA paper, provides a standard evaluation framework for tabletop manipulation tasks across four task suites: spatial, object, goal, and long-horizon. Run LIBERO evaluations on your fine-tuned adapter before deploying to real hardware to confirm that task success rate improves over the base checkpoint.

A minimal evaluation run:

bash
# Install LIBERO
pip install libero

# Evaluate base checkpoint
python eval_libero.py \
  --model_path /data/models/openvla-7b \
  --suite libero_spatial \
  --num_trials 50

# Evaluate fine-tuned checkpoint
python eval_libero.py \
  --model_path /data/models/openvla-7b \
  --lora_path /data/checkpoints/my_robot_lora \
  --suite libero_spatial \
  --num_trials 50

If LIBERO task success rate does not improve after fine-tuning, check your action normalization statistics and confirm your dataset covers the full range of the task's object positions and configurations.

For related fine-tuning pipelines, see GRPO fine-tuning on GPU Cloud for reasoning-oriented RL-based approaches and DPO fine-tuning on GPU Cloud for preference-based refinement after an initial LoRA step.

Deployment Patterns

Edge Robot with Cloud GPU

The robot sends compressed observation images over a low-latency WAN connection. The cloud GPU instance runs OpenVLA and returns action tokens. The robot de-tokenizes and executes.

This works for 5 Hz control loops with a reliable network hop under 20 ms each way. Pair with action chunking to buffer against variable network latency. It fails at 10 Hz or on unreliable connections, because a single dropped packet or 50 ms network spike puts the controller behind schedule.

Best for: mobile robots or manipulators with 5 Hz target control frequency, located within a campus or data center network of the cloud instance.

Hybrid Inference Split

Run the Prismatic ViT encoder on the robot's onboard GPU (an RTX 4090 or similar) and send only the resulting visual embeddings to the cloud for the LM decoder step. The raw RGB image at 224x224 pixels is roughly 150 KB. The ViT output embedding is a few hundred float32 values, typically under 10 KB. This cuts image transfer bandwidth by a factor of 15 or more.

More importantly, network transfer latency for a 10 KB embedding over a 1 Gbps LAN is under 1 ms, versus 5-50 ms for a full image over WAN. The cloud only needs to run the 7B LM decoder, which generates 7 action tokens quickly.

Best for: robots with a local GPU (RTX 4090, RTX Pro 6000) that need the LM capacity of a 7B model but want to keep the visual processing local.

Fallback to a Lightweight Policy Head

Keep a compact MLP policy on the robot as a fallback for when the cloud call misses the latency SLA. The MLP is trained by behavioral cloning on the same demonstration data. It handles routine, repetitive motions where the 7B model is not needed, and activates only when the cloud call exceeds a latency threshold.

OpenVLA handles novel, ambiguous, or instruction-conditional tasks. The MLP handles the high-frequency portions of motions where the trajectory is already committed. This dual-track setup means a network hiccup does not stop the robot mid-task.

For more on combining cloud and edge inference, see Hybrid Cloud and Edge AI Inference Guide.

OpenVLA vs RT-2 vs Pi-0

ModelOpen-Weight?Fine-Tunable?Inference LatencyAction SpaceTraining Data
OpenVLA 7BYes (MIT)Yes (LoRA)~100-150 ms self-hosted H1007-DoF deltaOpen X-Embodiment (~970k eps)
RT-2 (Google)NoNo200-600 ms (API)6-DoF deltaGoogle internal
Pi-0 (Physical Intelligence)NoPartial (via API)~300 ms (API)Flow matching policyPhysical Intelligence internal

OpenVLA's openness matters when your team has proprietary robot demonstrations, works under data residency constraints, operates in environments without reliable internet, or cannot absorb per-call API fees at scale. A fleet of 20 robots making 10 calls per second is 200 calls per second; API costs add up fast at that frequency.

RT-2 and Pi-0 have an edge in pre-trained generalization. They were trained on far more data and compute. For teams without proprietary embodiment data or latency constraints, the API options are simpler to get started with. The decision usually comes down to whether you need to fine-tune on your own robot, or whether the base model's generalization is sufficient.

Teams running multi-robot fleets at scale often need the same TRT-LLM engine optimizations covered in the TensorRT-LLM Production Deployment Guide.


Robotics teams using OpenVLA need predictable bare-metal latency, not serverless cold starts. Spheron provides on-demand H100 SXM5 instances from $3.10/hr and A100 80GB from $1.64/hr, with provisioning in under 90 seconds and no minimum commitment.

Rent H100 → | Rent A100 → | Rent L40S → | View all GPU pricing →

Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.