Ray Serve on GPU Cloud: Production LLM Serving Guide (2026)

A single-model vLLM endpoint stops being enough the moment you need to chain an embedding model into a reranker into an LLM. You end up either writing a custom HTTP proxy or bolting together microservices that weren't designed to talk to each other. Ray Serve solves multi-model orchestration in Python: one codebase, typed inter-service calls, and autoscaling that responds to queue depth rather than CPU utilization. Before diving into Ray Serve specifics, the vLLM production deployment guide covers vLLM fundamentals you'll need for the single-model baseline. For framework alternatives, the Triton Inference Server deployment guide and SGLang deployment guide cover the other major options.

Ray Serve Architecture: Deployments, Replicas, and DeploymentHandle

Ray Serve's serving model has three concepts worth understanding before you write any code.

Deployments are Python classes decorated with @serve.deployment. Each deployment is independently scalable and runs as a set of Ray actors. You define the model loading in __init__ and inference logic in async methods. Deployments map cleanly to the ML building blocks you already have: one class per model, one class per pipeline stage.

Replicas are instances of a deployment. Ray Serve's HTTP proxy load-balances requests across replicas using round-robin by default. You can switch to consistent hashing for session affinity when you need it. Adding replicas is a config change, not a code change: num_replicas: 4 in your YAML and Ray handles placement across the cluster.

DeploymentHandle is a typed Python object for calling one deployment from another without going through HTTP. Instead of requests.post("http://localhost:8001/embed", ...), you call await self.embedding.encode.remote(text) and get the result back as a Python value. This removes the serialization overhead and inter-service latency of HTTP round-trips for calls that stay inside the cluster.

One Ray 2.x API change that trips people up: Deployment.get_handle() was removed. Use serve.get_deployment_handle("deployment_name") or inject a DeploymentHandle via the constructor when composing pipelines.

The @serve.ingress decorator wraps a FastAPI router, so your HTTP surface looks like a normal FastAPI app. Incoming requests hit Ray Serve's HTTP proxy, which routes them to available replicas.

When to Choose Ray Serve

Not every LLM workload needs Ray Serve. Here's when it earns its place:

Use Case	Best Fit
Single LLM, OpenAI-compatible endpoint	vLLM built-in server
Multi-model pipeline (embed + rerank + LLM)	Ray Serve
Multi-framework serving (LLM + ONNX + TRT)	Triton Inference Server
Agentic workloads with high prefix reuse	SGLang
Python-native autoscaling with queue depth signals	Ray Serve
Kubernetes-native serving with existing k8s cluster	llm-d or vLLM Helm chart

For throughput numbers on how these frameworks compare head-to-head, see vLLM vs TensorRT-LLM vs SGLang benchmarks.

The short version: if you're serving one model and your request pipeline has no branching or multi-model composition, vLLM's built-in OpenAI server is the right tool. Ray Serve adds overhead (~1-2ms per request) that's only worth it when you get something back: pipeline composition, queue-depth autoscaling, or Python-native routing logic.

Single-Node Ray Serve + vLLM on Spheron

Start with a single-node setup. One GPU, one head node, one deployment.

Provision the instance. Get an H100 GPU rental on Spheron. SSH in and verify CUDA is available:

bash

nvidia-smi
# Should show H100 80GB with CUDA version

Install dependencies:

bash

pip install 'ray[serve]==2.24.0' vllm torch
# Confirm versions match
python -c "import ray; print(ray.__version__)"
python -c "import vllm; print(vllm.__version__)"

Start Ray:

bash

ray start --head \
  --port=6379 \
  --dashboard-host=0.0.0.0 \
  --dashboard-port=8265 \
  --block

Ray Dashboard will be available at http://<instance-ip>:8265.

Write the deployment. Save this as vllm_deployment.py:

python

import asyncio
import uuid
from typing import AsyncGenerator
from fastapi import FastAPI
from ray import serve
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams

app = FastAPI()

@serve.deployment(
    num_replicas=1,
    ray_actor_options={"num_gpus": 1},
    max_ongoing_requests=100,
)
@serve.ingress(app)
class VLLMDeployment:
    def __init__(self, model: str = "meta-llama/Llama-3.3-70B-Instruct"):
        engine_args = AsyncEngineArgs(
            model=model,
            dtype="fp8",
            gpu_memory_utilization=0.90,
            max_model_len=8192,
        )
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)

    @app.post("/v1/chat/completions")
    async def chat(self, request: dict):
        messages = request.get("messages", [])
        prompt = "\n".join(f"{m.get('role', '')}: {m.get('content', '')}" for m in messages)
        sampling_params = SamplingParams(
            temperature=request.get("temperature", 0.7),
            max_tokens=request.get("max_tokens", 512),
        )
        request_id = str(uuid.uuid4())
        results = []
        async for output in self.engine.generate(prompt, sampling_params, request_id):
            if output.finished and output.outputs:
                results.append(output.outputs[0].text)
        return {"choices": [{"message": {"content": results[-1] if results else ""}}]}

    @app.get("/v1/models")
    async def models(self):
        return {"data": [{"id": "llama-3.3-70b", "object": "model"}]}


entrypoint = VLLMDeployment.bind()

Create serve_config.yaml:

yaml

applications:
  - name: llm-api
    import_path: vllm_deployment:entrypoint
    deployments:
      - name: VLLMDeployment
        ray_actor_options:
          num_gpus: 1
        autoscaling_config:
          min_replicas: 1
          max_replicas: 2
          target_ongoing_requests: 5
          upscale_delay_s: 10
          downscale_delay_s: 60

Deploy and test:

bash

serve deploy serve_config.yaml
# Wait for deployment to become healthy (~2-3 minutes for model load)

curl http://localhost:8000/v1/models
# {"data": [{"id": "llama-3.3-70b", "object": "model"}]}

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 50}'

Multi-Node Ray Cluster on Spheron

For models that don't fit on a single GPU or for higher throughput, you need multiple nodes.

Provision two instances in the same Spheron region. Instances in the same region share a private subnet, which is what you want for Ray cluster communication. Note the private IP of each instance:

bash

hostname -I | awk '{print $1}'
# Returns something like 10.0.1.5

Use this private IP for cluster communication, not the public IP. The public IP works but adds latency and may hit firewall restrictions between nodes.

Firewall note: Open ports 6379 (Ray head), 8265 (Dashboard), and the range 10001-10999 (Ray object store) between nodes on your private network before starting the cluster.

Start the head node:

bash

# On head node
ray start \
  --head \
  --port=6379 \
  --dashboard-host=0.0.0.0 \
  --dashboard-port=8265 \
  --block

Record the head node's private IP: HEAD_IP=$(hostname -I | awk '{print $1}').

Join worker nodes:

bash

# On each worker node - use head node's PRIVATE IP
ray start \
  --address=10.0.1.5:6379 \
  --num-gpus=1 \
  --block

Workers must reach the head node over the private network. Verify connectivity before joining:

bash

ping 10.0.1.5
# Should respond before running ray start

Verify the cluster from the head node:

bash

ray status
# Expected output:
# Resources
# ---------------------------------------------------------------
# Usage:
#  0.0/2.0 GPU
#  2.0/8.0 CPU
# ...
# Node status
# ---------------------------------------------------------------
# Active:
#  2 node(s) with resources: {"GPU": 1.0, "CPU": 4.0}

You should see 2 nodes and 2 GPUs listed.

Scale the deployment across both nodes by bumping num_replicas:

yaml

# serve_config.yaml - updated for multi-node
applications:
  - name: llm-api
    import_path: vllm_deployment:entrypoint
    deployments:
      - name: VLLMDeployment
        num_replicas: 2
        ray_actor_options:
          num_gpus: 1

Ray Serve places each replica on a node with an available GPU. With 2 nodes and 2 GPUs, each replica gets its own GPU. Deploy the same way:

bash

serve deploy serve_config.yaml

For multi-node bandwidth and networking context, see the multi-node networking guide for tradeoffs when InfiniBand is not available.

Autoscaling LLM Replicas Based on Queue Depth

Ray Serve's autoscaling policy is queue-depth-based, which is the right signal for LLM serving. CPU or memory utilization doesn't capture the actual bottleneck: GPU saturation and inference queue length.

yaml

autoscaling_config:
  min_replicas: 1
  max_replicas: 4
  target_ongoing_requests: 5
  upscale_delay_s: 10
  downscale_delay_s: 60

How it works. target_ongoing_requests: 5 means Ray Serve adds a replica when the average in-flight request count per replica exceeds 5. With 10 concurrent requests hitting one replica, Ray Serve triggers scale-up. With traffic dropping below the target, it waits downscale_delay_s seconds before removing a replica.

GPU provisioning limit. Autoscaling adds replicas but does not provision new GPU nodes. If max_replicas: 4 and you only have 2 GPUs in the cluster, Ray Serve can only place 2 replicas (one GPU per replica). The other 2 pending replicas sit in "PENDING" state until GPUs become available. Combine Ray Serve autoscaling with external node provisioning (or pre-provision spare GPUs in the cluster) if you want true elastic scale.

For teams running on Kubernetes, the Kubernetes GPU orchestration guide covers k8s-native autoscaling that can provision new GPU nodes automatically.

Composing Multi-Model Agent Pipelines

The place Ray Serve beats a collection of vLLM servers is multi-model pipelines. Here's a concrete RAG pipeline: embedding -> reranker -> LLM, wired together with DeploymentHandle.

python

import asyncio
from ray import serve
from ray.serve.handle import DeploymentHandle
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    query: str
    candidates: list[str]

@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class EmbeddingModel:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("BAAI/bge-small-en-v1.5")

    def encode(self, text: str) -> list[float]:
        return self.model.encode(text).tolist()

@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class Reranker:
    def __init__(self):
        from sentence_transformers import CrossEncoder
        self.model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

    def rerank(self, query: str, candidates: list[str]) -> list[str]:
        scores = self.model.predict([(query, c) for c in candidates])
        ranked = sorted(zip(scores, candidates), reverse=True)
        return [c for _, c in ranked[:3]]

@serve.deployment(ray_actor_options={"num_gpus": 1})
class LLM:
    def __init__(self):
        from vllm import AsyncLLMEngine, AsyncEngineArgs
        engine_args = AsyncEngineArgs(model="meta-llama/Llama-3.1-8B-Instruct")
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)

    async def generate(self, context: list[str], query: str) -> str:
        prompt = f"Context:\n{chr(10).join(context)}\n\nQuestion: {query}\nAnswer:"
        from vllm import SamplingParams
        import uuid
        params = SamplingParams(max_tokens=256)
        request_id = str(uuid.uuid4())
        async for output in self.engine.generate(prompt, params, request_id):
            if output.finished and output.outputs:
                return output.outputs[0].text
        return ""

@serve.deployment
@serve.ingress(app)
class RAGPipeline:
    def __init__(
        self,
        embedding: DeploymentHandle,
        reranker: DeploymentHandle,
        llm: DeploymentHandle,
    ):
        self.embedding = embedding
        self.reranker = reranker
        self.llm = llm

    @app.post("/query")
    async def query(self, request: QueryRequest):
        import numpy as np
        query_embedding = await self.embedding.encode.remote(request.query)
        candidate_embeddings = await asyncio.gather(
            *[self.embedding.encode.remote(c) for c in request.candidates]
        )
        q = np.array(query_embedding)
        scores = [
            float(np.dot(q, np.array(e)) / (np.linalg.norm(q) * np.linalg.norm(np.array(e)) + 1e-9))
            for e in candidate_embeddings
        ]
        top_k = sorted(zip(scores, request.candidates), reverse=True)[:10]
        filtered_candidates = [c for _, c in top_k]
        top_docs = await self.reranker.rerank.remote(request.query, filtered_candidates)
        answer = await self.llm.generate.remote(top_docs, request.query)
        return {"answer": answer}


embedding = EmbeddingModel.bind()
reranker = Reranker.bind()
llm = LLM.bind()
entrypoint = RAGPipeline.bind(embedding=embedding, reranker=reranker, llm=llm)

Fractional GPU allocation. num_gpus: 0.5 lets two small models share a single GPU. This only works for models that fit in half the GPU's VRAM. vLLM does not support fractional GPUs: it requires whole integers. Use fractional allocation for embedding and reranker models (which are small), not for the main LLM deployment.

The DeploymentHandle injection via constructor (embedding: DeploymentHandle in RAGPipeline.__init__) is the Ray 2.x pattern. The handles are created by .bind() calls when composing the application graph and passed as constructor arguments when the deployment is initialized.

Teams using DSPy compiled programs on Spheron typically wrap the compiled artifact in a Ray Serve deployment. The compiled program is a stateless callable, which maps cleanly onto a Ray Serve Deployment class.

For routing-only patterns without full pipeline composition, the LLM inference router guide covers lighter-weight approaches.

Observability: Ray Dashboard, Prometheus, and Request Traces

Ray Serve exposes three observability surfaces.

Ray Dashboard runs at http://<head-ip>:8265. It shows replica health (RUNNING/PENDING/UNHEALTHY), in-flight request counts per deployment, and per-node GPU utilization. This is the first place to look when a deployment is not scaling as expected.

Prometheus metrics scrape from http://<head-ip>:8080/metrics. Key metrics for LLM serving:

Metric	What It Tells You
`ray_serve_num_ongoing_requests`	Current in-flight requests per replica
`ray_serve_deployment_replica_starts`	Replica churn (high values indicate instability)
`ray_serve_http_request_latency_ms`	End-to-end request latency at the HTTP proxy
`ray_serve_queue_length`	Pending requests waiting for a replica

Wire these into a Grafana dashboard alongside GPU utilization from nvidia_smi_* metrics for a complete serving picture.

OpenTelemetry trace export is available via ray.init() environment variables for distributed tracing across deployments. For most serving use cases, the Ray Dashboard and Prometheus metrics are sufficient. Distributed traces add value when debugging latency spikes across a multi-stage pipeline where you need to isolate which deployment stage is the bottleneck.

For broader GPU observability, the GPU monitoring guide covers the full stack from nvidia-smi to production alerting.

Cost: Ray Serve on Spheron vs Managed Anyscale for a 70B Model

Running a 70B parameter model at 10,000 requests per hour with 512 output tokens per request.

A single H100 PCIe 80GB fits Llama 3.3 70B at FP8 with reduced max_model_len (e.g. 4096) and small batch sizes, delivering roughly 12,000 requests per hour at that output length. For unconstrained production throughput, two H100 80GB GPUs are the practical minimum.

Configuration	GPU	Throughput (req/hr)	Cost/hr	Cost per 1M requests
Spheron bare-metal H100 + Ray Serve	1x H100 PCIe	~12k req/hr	$2.01/hr (on-demand)	~$168
Anyscale managed Ray	H100 (managed)	~12k req/hr	~3-4x Spheron rate	significantly higher

Anyscale's managed platform adds a significant markup over bare-metal GPU rates. Their pricing includes cluster management, autoscaling automation, and the Anyscale platform overhead. For teams that want Ray without platform lock-in, bare-metal GPU cloud gives you the same Ray framework at infrastructure cost. For a full alternative comparison across 10 providers, see Anyscale alternatives for Ray-native training and inference.

For teams on a tighter budget, A100 on Spheron handles Llama 3.1 8B and 13B models at $1.04/hr on-demand. At much higher throughput for smaller models, the cost per million requests still lands well below $100.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Ray Serve on Spheron bare-metal GPUs gives you the Python-native flexibility of managed Anyscale with predictable pricing and no platform markup. Provision a head node and start serving today.
On-demand H100 → | On-demand A100 → | View all pricing →

STEPS / 06

Quick Setup Guide

Provision GPU instances on Spheron
For single-node, provision one H100 PCIe 80GB instance at app.spheron.ai. For multi-node, provision a head node and one or more worker nodes (H100 or A100 80GB), all in the same region to share a private subnet. SSH into each and verify CUDA with nvidia-smi.
Install Ray and vLLM on all nodes
Run: pip install 'ray[serve]==2.24.0' vllm torch. Use identical Ray versions across all nodes - a version mismatch causes cluster join failures. Verify: python -c 'import ray; print(ray.__version__)'. Alternatively, use the rayproject/ray-ml Docker image which bundles Ray, vLLM, and CUDA.
Start the Ray head node
On the head node, run: ray start --head --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=8265 --block. The Ray Dashboard is accessible at http://<head-ip>:8265. Record the private IP with: hostname -I | awk '{print $1}'.
Connect worker nodes to the cluster
On each worker node, run: ray start --address=<head-private-ip>:6379 --num-gpus=<count> --block. Workers must reach the head node over the private network - verify with ping <head-private-ip> before joining. Confirm the cluster with: ray status from the head node.
Deploy a vLLM-backed Ray Serve application
Write serve_config.yaml defining a deployment class that wraps vllm.AsyncLLMEngine, with num_replicas and ray_actor_options.num_gpus set. Deploy with: serve deploy serve_config.yaml. Verify: curl http://localhost:8000/v1/models.
Configure autoscaling and monitor via Ray Dashboard
Add autoscaling_config to your deployment with min_replicas, max_replicas, and target_ongoing_requests. Open Ray Dashboard at http://<head-ip>:8265 to monitor replica counts and queue depth. Prometheus metrics scrape from http://<head-ip>:8080/metrics.

FAQ / 05

Frequently Asked Questions

Ray Serve is a Python-native serving layer that wraps inference engines like vLLM. vLLM's built-in server handles one model and one endpoint. Ray Serve handles multiple models, request pipelines, autoscaling policies, and inter-deployment communication via DeploymentHandle - all in Python without configuration files or protobuf schemas. It adds ~1-2ms overhead per request compared to calling vLLM directly.

Use Ray Serve when you need multi-model pipelines (embedding + reranker + LLM in one graph), request-level routing between models, autoscaling driven by queue depth, or agent workflows where one deployment calls another. For a single model endpoint with no pipeline logic, vLLM's built-in server is simpler with lower overhead.

Provision two or more GPU instances on the same private network. On the head node, run: ray start --head --port=6379 --dashboard-host=0.0.0.0 --block. On each worker, run: ray start --address=<head-private-ip>:6379 --num-gpus=<count> --block. Verify with ray status from the head node. Deploy your application with serve deploy config.yaml.

A single H100 PCIe 80GB fits Llama 3.3 70B at FP8 with reduced max_model_len (e.g. 4096) and small batch sizes. For production throughput without memory pressure, two H100 80GB GPUs are the practical minimum. For FP16, two H100 80GB GPUs with tensor parallelism are needed. Ray Serve places vLLM replicas across cluster GPUs, so total cluster VRAM determines capacity. Scale horizontally by adding worker nodes.

Ray Serve autoscaling adds or removes replicas based on in-flight requests per replica and queue depth. Configure target_ongoing_requests, min_replicas, and max_replicas in autoscaling_config. Ray Serve scales up when queue depth exceeds the target and scales down after a configurable downscale_delay_s once load drops.

Ray Serve Architecture: Deployments, Replicas, and DeploymentHandle

When to Choose Ray Serve

Single-Node Ray Serve + vLLM on Spheron

Multi-Node Ray Cluster on Spheron

Autoscaling LLM Replicas Based on Queue Depth

Composing Multi-Model Agent Pipelines

Observability: Ray Dashboard, Prometheus, and Request Traces

Cost: Ray Serve on Spheron vs Managed Anyscale for a 70B Model

Quick Setup Guide

Provision GPU instances on Spheron

Install Ray and vLLM on all nodes

Start the Ray head node

Connect worker nodes to the cluster

Deploy a vLLM-backed Ray Serve application

Configure autoscaling and monitor via Ray Dashboard

Frequently Asked Questions

01What is Ray Serve and how does it differ from vLLM's built-in OpenAI server?

02When should I use Ray Serve over vLLM's OpenAI-compatible server?

03How do I set up a multi-node Ray cluster on GPU cloud?

04What GPU do I need for Ray Serve with a 70B LLM?

05How does Ray Serve autoscaling work for LLM replicas?

Try It on Real GPUs