A single-model vLLM endpoint stops being enough the moment you need to chain an embedding model into a reranker into an LLM. You end up either writing a custom HTTP proxy or bolting together microservices that weren't designed to talk to each other. Ray Serve solves multi-model orchestration in Python: one codebase, typed inter-service calls, and autoscaling that responds to queue depth rather than CPU utilization. Before diving into Ray Serve specifics, the vLLM production deployment guide covers vLLM fundamentals you'll need for the single-model baseline. For framework alternatives, the Triton Inference Server deployment guide and SGLang deployment guide cover the other major options.
Ray Serve Architecture: Deployments, Replicas, and DeploymentHandle
Ray Serve's serving model has three concepts worth understanding before you write any code.
Deployments are Python classes decorated with @serve.deployment. Each deployment is independently scalable and runs as a set of Ray actors. You define the model loading in __init__ and inference logic in async methods. Deployments map cleanly to the ML building blocks you already have: one class per model, one class per pipeline stage.
Replicas are instances of a deployment. Ray Serve's HTTP proxy load-balances requests across replicas using round-robin by default. You can switch to consistent hashing for session affinity when you need it. Adding replicas is a config change, not a code change: num_replicas: 4 in your YAML and Ray handles placement across the cluster.
DeploymentHandle is a typed Python object for calling one deployment from another without going through HTTP. Instead of requests.post("http://localhost:8001/embed", ...), you call await self.embedding.encode.remote(text) and get the result back as a Python value. This removes the serialization overhead and inter-service latency of HTTP round-trips for calls that stay inside the cluster.
One Ray 2.x API change that trips people up: Deployment.get_handle() was removed. Use serve.get_deployment_handle("deployment_name") or inject a DeploymentHandle via the constructor when composing pipelines.
The @serve.ingress decorator wraps a FastAPI router, so your HTTP surface looks like a normal FastAPI app. Incoming requests hit Ray Serve's HTTP proxy, which routes them to available replicas.
When to Choose Ray Serve
Not every LLM workload needs Ray Serve. Here's when it earns its place:
| Use Case | Best Fit |
|---|---|
| Single LLM, OpenAI-compatible endpoint | vLLM built-in server |
| Multi-model pipeline (embed + rerank + LLM) | Ray Serve |
| Multi-framework serving (LLM + ONNX + TRT) | Triton Inference Server |
| Agentic workloads with high prefix reuse | SGLang |
| Python-native autoscaling with queue depth signals | Ray Serve |
| Kubernetes-native serving with existing k8s cluster | llm-d or vLLM Helm chart |
For throughput numbers on how these frameworks compare head-to-head, see vLLM vs TensorRT-LLM vs SGLang benchmarks.
The short version: if you're serving one model and your request pipeline has no branching or multi-model composition, vLLM's built-in OpenAI server is the right tool. Ray Serve adds overhead (~1-2ms per request) that's only worth it when you get something back: pipeline composition, queue-depth autoscaling, or Python-native routing logic.
Single-Node Ray Serve + vLLM on Spheron
Start with a single-node setup. One GPU, one head node, one deployment.
Provision the instance. Get an H100 GPU rental on Spheron. SSH in and verify CUDA is available:
nvidia-smi
# Should show H100 80GB with CUDA versionInstall dependencies:
pip install 'ray[serve]==2.24.0' vllm torch
# Confirm versions match
python -c "import ray; print(ray.__version__)"
python -c "import vllm; print(vllm.__version__)"Start Ray:
ray start --head \
--port=6379 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--blockRay Dashboard will be available at http://<instance-ip>:8265.
Write the deployment. Save this as vllm_deployment.py:
import asyncio
import uuid
from typing import AsyncGenerator
from fastapi import FastAPI
from ray import serve
from vllm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
app = FastAPI()
@serve.deployment(
num_replicas=1,
ray_actor_options={"num_gpus": 1},
max_ongoing_requests=100,
)
@serve.ingress(app)
class VLLMDeployment:
def __init__(self, model: str = "meta-llama/Llama-3.3-70B-Instruct"):
engine_args = AsyncEngineArgs(
model=model,
dtype="fp8",
gpu_memory_utilization=0.90,
max_model_len=8192,
)
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
@app.post("/v1/chat/completions")
async def chat(self, request: dict):
messages = request.get("messages", [])
prompt = "\n".join(f"{m.get('role', '')}: {m.get('content', '')}" for m in messages)
sampling_params = SamplingParams(
temperature=request.get("temperature", 0.7),
max_tokens=request.get("max_tokens", 512),
)
request_id = str(uuid.uuid4())
results = []
async for output in self.engine.generate(prompt, sampling_params, request_id):
if output.finished and output.outputs:
results.append(output.outputs[0].text)
return {"choices": [{"message": {"content": results[-1] if results else ""}}]}
@app.get("/v1/models")
async def models(self):
return {"data": [{"id": "llama-3.3-70b", "object": "model"}]}
entrypoint = VLLMDeployment.bind()Create serve_config.yaml:
applications:
- name: llm-api
import_path: vllm_deployment:entrypoint
deployments:
- name: VLLMDeployment
ray_actor_options:
num_gpus: 1
autoscaling_config:
min_replicas: 1
max_replicas: 2
target_ongoing_requests: 5
upscale_delay_s: 10
downscale_delay_s: 60Deploy and test:
serve deploy serve_config.yaml
# Wait for deployment to become healthy (~2-3 minutes for model load)
curl http://localhost:8000/v1/models
# {"data": [{"id": "llama-3.3-70b", "object": "model"}]}
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 50}'Multi-Node Ray Cluster on Spheron
For models that don't fit on a single GPU or for higher throughput, you need multiple nodes.
Provision two instances in the same Spheron region. Instances in the same region share a private subnet, which is what you want for Ray cluster communication. Note the private IP of each instance:
hostname -I | awk '{print $1}'
# Returns something like 10.0.1.5Use this private IP for cluster communication, not the public IP. The public IP works but adds latency and may hit firewall restrictions between nodes.
Firewall note: Open ports 6379 (Ray head), 8265 (Dashboard), and the range 10001-10999 (Ray object store) between nodes on your private network before starting the cluster.
Start the head node:
# On head node
ray start \
--head \
--port=6379 \
--dashboard-host=0.0.0.0 \
--dashboard-port=8265 \
--blockRecord the head node's private IP: HEAD_IP=$(hostname -I | awk '{print $1}').
Join worker nodes:
# On each worker node - use head node's PRIVATE IP
ray start \
--address=10.0.1.5:6379 \
--num-gpus=1 \
--blockWorkers must reach the head node over the private network. Verify connectivity before joining:
ping 10.0.1.5
# Should respond before running ray startVerify the cluster from the head node:
ray status
# Expected output:
# Resources
# ---------------------------------------------------------------
# Usage:
# 0.0/2.0 GPU
# 2.0/8.0 CPU
# ...
# Node status
# ---------------------------------------------------------------
# Active:
# 2 node(s) with resources: {"GPU": 1.0, "CPU": 4.0}You should see 2 nodes and 2 GPUs listed.
Scale the deployment across both nodes by bumping num_replicas:
# serve_config.yaml - updated for multi-node
applications:
- name: llm-api
import_path: vllm_deployment:entrypoint
deployments:
- name: VLLMDeployment
num_replicas: 2
ray_actor_options:
num_gpus: 1Ray Serve places each replica on a node with an available GPU. With 2 nodes and 2 GPUs, each replica gets its own GPU. Deploy the same way:
serve deploy serve_config.yamlFor multi-node bandwidth and networking context, see the multi-node networking guide for tradeoffs when InfiniBand is not available.
Autoscaling LLM Replicas Based on Queue Depth
Ray Serve's autoscaling policy is queue-depth-based, which is the right signal for LLM serving. CPU or memory utilization doesn't capture the actual bottleneck: GPU saturation and inference queue length.
autoscaling_config:
min_replicas: 1
max_replicas: 4
target_ongoing_requests: 5
upscale_delay_s: 10
downscale_delay_s: 60How it works. target_ongoing_requests: 5 means Ray Serve adds a replica when the average in-flight request count per replica exceeds 5. With 10 concurrent requests hitting one replica, Ray Serve triggers scale-up. With traffic dropping below the target, it waits downscale_delay_s seconds before removing a replica.
GPU provisioning limit. Autoscaling adds replicas but does not provision new GPU nodes. If max_replicas: 4 and you only have 2 GPUs in the cluster, Ray Serve can only place 2 replicas (one GPU per replica). The other 2 pending replicas sit in "PENDING" state until GPUs become available. Combine Ray Serve autoscaling with external node provisioning (or pre-provision spare GPUs in the cluster) if you want true elastic scale.
For teams running on Kubernetes, the Kubernetes GPU orchestration guide covers k8s-native autoscaling that can provision new GPU nodes automatically.
Composing Multi-Model Agent Pipelines
The place Ray Serve beats a collection of vLLM servers is multi-model pipelines. Here's a concrete RAG pipeline: embedding -> reranker -> LLM, wired together with DeploymentHandle.
import asyncio
from ray import serve
from ray.serve.handle import DeploymentHandle
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class QueryRequest(BaseModel):
query: str
candidates: list[str]
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class EmbeddingModel:
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer("BAAI/bge-small-en-v1.5")
def encode(self, text: str) -> list[float]:
return self.model.encode(text).tolist()
@serve.deployment(ray_actor_options={"num_gpus": 0.5})
class Reranker:
def __init__(self):
from sentence_transformers import CrossEncoder
self.model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(self, query: str, candidates: list[str]) -> list[str]:
scores = self.model.predict([(query, c) for c in candidates])
ranked = sorted(zip(scores, candidates), reverse=True)
return [c for _, c in ranked[:3]]
@serve.deployment(ray_actor_options={"num_gpus": 1})
class LLM:
def __init__(self):
from vllm import AsyncLLMEngine, AsyncEngineArgs
engine_args = AsyncEngineArgs(model="meta-llama/Llama-3.1-8B-Instruct")
self.engine = AsyncLLMEngine.from_engine_args(engine_args)
async def generate(self, context: list[str], query: str) -> str:
prompt = f"Context:\n{chr(10).join(context)}\n\nQuestion: {query}\nAnswer:"
from vllm import SamplingParams
import uuid
params = SamplingParams(max_tokens=256)
request_id = str(uuid.uuid4())
async for output in self.engine.generate(prompt, params, request_id):
if output.finished and output.outputs:
return output.outputs[0].text
return ""
@serve.deployment
@serve.ingress(app)
class RAGPipeline:
def __init__(
self,
embedding: DeploymentHandle,
reranker: DeploymentHandle,
llm: DeploymentHandle,
):
self.embedding = embedding
self.reranker = reranker
self.llm = llm
@app.post("/query")
async def query(self, request: QueryRequest):
import numpy as np
query_embedding = await self.embedding.encode.remote(request.query)
candidate_embeddings = await asyncio.gather(
*[self.embedding.encode.remote(c) for c in request.candidates]
)
q = np.array(query_embedding)
scores = [
float(np.dot(q, np.array(e)) / (np.linalg.norm(q) * np.linalg.norm(np.array(e)) + 1e-9))
for e in candidate_embeddings
]
top_k = sorted(zip(scores, request.candidates), reverse=True)[:10]
filtered_candidates = [c for _, c in top_k]
top_docs = await self.reranker.rerank.remote(request.query, filtered_candidates)
answer = await self.llm.generate.remote(top_docs, request.query)
return {"answer": answer}
embedding = EmbeddingModel.bind()
reranker = Reranker.bind()
llm = LLM.bind()
entrypoint = RAGPipeline.bind(embedding=embedding, reranker=reranker, llm=llm)Fractional GPU allocation. num_gpus: 0.5 lets two small models share a single GPU. This only works for models that fit in half the GPU's VRAM. vLLM does not support fractional GPUs: it requires whole integers. Use fractional allocation for embedding and reranker models (which are small), not for the main LLM deployment.
The DeploymentHandle injection via constructor (embedding: DeploymentHandle in RAGPipeline.__init__) is the Ray 2.x pattern. The handles are created by .bind() calls when composing the application graph and passed as constructor arguments when the deployment is initialized.
For routing-only patterns without full pipeline composition, the LLM inference router guide covers lighter-weight approaches.
Observability: Ray Dashboard, Prometheus, and Request Traces
Ray Serve exposes three observability surfaces.
Ray Dashboard runs at http://<head-ip>:8265. It shows replica health (RUNNING/PENDING/UNHEALTHY), in-flight request counts per deployment, and per-node GPU utilization. This is the first place to look when a deployment is not scaling as expected.
Prometheus metrics scrape from http://<head-ip>:8080/metrics. Key metrics for LLM serving:
| Metric | What It Tells You |
|---|---|
ray_serve_num_ongoing_requests | Current in-flight requests per replica |
ray_serve_deployment_replica_starts | Replica churn (high values indicate instability) |
ray_serve_http_request_latency_ms | End-to-end request latency at the HTTP proxy |
ray_serve_queue_length | Pending requests waiting for a replica |
Wire these into a Grafana dashboard alongside GPU utilization from nvidia_smi_* metrics for a complete serving picture.
OpenTelemetry trace export is available via ray.init() environment variables for distributed tracing across deployments. For most serving use cases, the Ray Dashboard and Prometheus metrics are sufficient. Distributed traces add value when debugging latency spikes across a multi-stage pipeline where you need to isolate which deployment stage is the bottleneck.
For broader GPU observability, the GPU monitoring guide covers the full stack from nvidia-smi to production alerting.
Cost: Ray Serve on Spheron vs Managed Anyscale for a 70B Model
Running a 70B parameter model at 10,000 requests per hour with 512 output tokens per request.
A single H100 PCIe 80GB fits Llama 3.3 70B at FP8 with reduced max_model_len (e.g. 4096) and small batch sizes, delivering roughly 12,000 requests per hour at that output length. For unconstrained production throughput, two H100 80GB GPUs are the practical minimum.
| Configuration | GPU | Throughput (req/hr) | Cost/hr | Cost per 1M requests |
|---|---|---|---|---|
| Spheron bare-metal H100 + Ray Serve | 1x H100 PCIe | ~12k req/hr | $2.01/hr (on-demand) | ~$168 |
| Anyscale managed Ray | H100 (managed) | ~12k req/hr | ~3-4x Spheron rate | significantly higher |
Anyscale's managed platform adds a significant markup over bare-metal GPU rates. Their pricing includes cluster management, autoscaling automation, and the Anyscale platform overhead. For teams that want Ray without platform lock-in, bare-metal GPU cloud gives you the same Ray framework at infrastructure cost.
For teams on a tighter budget, A100 on Spheron handles Llama 3.1 8B and 13B models at $1.04/hr on-demand. At much higher throughput for smaller models, the cost per million requests still lands well below $100.
Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Ray Serve on Spheron bare-metal GPUs gives you the Python-native flexibility of managed Anyscale with predictable pricing and no platform markup. Provision a head node and start serving today.
