KServe vs Seldon Core vs BentoML on GPU Cloud: Kubernetes ML Serving Guide (2026)

Teams evaluating ML serving operators spend more time on architecture than benchmarks. KServe, Seldon Core v2, and BentoML all serve models on Kubernetes. Each one optimizes for a different set of problems. KServe gives you the tightest CNCF alignment with Knative-native scale-to-zero. Seldon Core v2 handles multi-step inference pipelines with built-in drift detection. BentoML gives Python developers the fastest path from local development to a production Kubernetes deployment. Which one fits your stack depends on what you are actually building.

TL;DR

KServe	Seldon Core v2	BentoML + Yatai
Core abstraction	InferenceService CRD	Model + Pipeline CRD	Bento archive + BentoDeployment
Multi-model serving	Runtime-based isolation	MLServer (multiple models per process)	Per-Bento isolation
Scale-to-zero	Native (via Knative)	No (needs external config)	No
GPU sharing model	MIG + node selectors, DRA	Multi-model per GPU (MLServer), MIG	MPS/time-slicing at node level
Best-fit use case	LLM endpoints, CNCF-aligned orgs	Multi-step pipelines, drift monitoring	Python-first teams, quick iteration
CNCF status	Incubating	Not a CNCF project	Not a CNCF project

Why Kubernetes-Native ML Serving Operators Matter

Running a model as a plain Kubernetes Deployment works for prototypes. For production, it falls apart in predictable ways. You handle no readiness logic beyond HTTP 200, so a pod that has loaded its container but not yet warmed its CUDA context receives live traffic and fails silently. Traffic splitting for model versioning requires an Ingress configuration that has no awareness of ML-specific headers or batch semantics. Rollback means manual redeployment with no state tracking of which model version served which traffic window. Observability is whatever your application emits; there is no standard metric surface for tokens per second, queue depth, or TTFT.

Kubernetes ML serving operators exist to fix that gap. They introduce CRDs that understand model serving semantics: version tracking, traffic splitting, runtime backend selection, VRAM-aware scheduling, and readiness probes that wait for model warm-up rather than just container health. The best operators also surface Prometheus metrics automatically and integrate with the GPU cluster's autoscaling layer.

The failure modes without an operator are real. A static VRAM allocation based on peak load wastes 40-60% of GPU memory during off-peak hours when requests are bursty. No canary rollback means a bad model version affects 100% of users until a manual redeployment completes. No queue-depth autoscaling means either over-provisioning (expensive) or under-provisioning (latency spikes). These are not hypothetical; teams hit them at the 1,000-request-per-day scale, not just at hyperscaler volumes.

For the GPU cluster scheduling layer that sits underneath these operators, the Kubernetes GPU Orchestration 2026 guide covers DRA, KAI Scheduler, and Grove for handling resource allocation and scheduling.

Architecture Overview

KServe: InferenceService and Serverless Backends

KServe started as KFServing inside the Kubeflow project and became independent in 2021. It is now a CNCF Incubating project (promoted from Sandbox in September 2025). The central resource is InferenceService, a CRD that describes a model version, its runtime backend, storage location, and scaling behavior.

KServe supports two deployment modes.

Serverless mode uses Knative Serving as the underlying transport layer. Traffic flows through the Knative Activator, which buffers requests during scale-to-zero and routes them to warm pods. This is the right mode for endpoints that receive bursty or unpredictable traffic where the idle cost of a warm pod is not justified.

RawDeployment mode uses standard Kubernetes Deployments and Services. No Knative dependency. Scale-to-zero is not available, but there is also no Knative overhead in the request path. This is the right mode for high-throughput LLM endpoints that need predictable latency.

KServe's pluggable runtime model lets you point an InferenceService at a vLLM container, a Triton container, or a HuggingFace TGI container without changing the CRD spec. The ClusterServingRuntime resource defines available backends cluster-wide, and the InferenceService references the runtime by name.

The ModelCar pattern is worth understanding for large LLM deployment. Instead of pulling weights from remote storage (S3, GCS) at pod startup, ModelCar stores the model as an init container image. The init container runs once, copies weights to a shared volume, and exits. The serving container reads from the local volume. This eliminates the remote fetch on every cold start. For a 140 GB Llama 3 70B model, this difference is 4-6 minutes (remote NFS fetch at 400-600 MB/s) vs 40 seconds (local NVMe at 3-4 GB/s).

Seldon Core v2: Inference Pipelines and MLServer

Seldon Core v2 is a complete rewrite from v1, not an incremental update. It switches the core abstraction from a single model endpoint to an inference pipeline. The two main CRDs are Model and Pipeline.

A Model CRD defines a single model loaded into a server process. A Pipeline CRD wires multiple models together in a directed acyclic graph: input goes to a preprocessor, then to a main model, then optionally to an explainer, then to an output. This DAG topology is first-class in the operator, not an afterthought.

The native server is MLServer, an open-source multi-model server that runs multiple ML models in a single process. MLServer supports scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, and HuggingFace runtimes. For GPU workloads, a single MLServer instance can load and serve multiple smaller models (7B, 13B) on one GPU, sharing GPU memory between them. This is relevant for teams running a portfolio of smaller models rather than one large one.

Kafka integration is a distinguishing feature. Seldon Core v2 supports async inference over Kafka topics natively. A model can be configured to consume from a Kafka input topic and write predictions to a Kafka output topic. This unlocks event-driven inference patterns that synchronous HTTP serving cannot express.

Alibi Detect integration is built in. Alibi Detect is Seldon's open-source library for outlier detection, adversarial detection, and concept drift monitoring. With Seldon Core v2, you add a drift detector as a node in the pipeline graph and it runs inline with inference requests without a separate deployment.

BentoML + Yatai: Python-Native Packaging to Kubernetes

BentoML is a model packaging framework. The core concept is a Bento: a self-contained archive containing the model weights, the serving code, the Python dependencies, and the runtime configuration. A Bento built on your laptop runs identically in Kubernetes, because the full serving environment is captured in the archive.

The serving code is written in Python with decorators:

python

import bentoml
from openllm import LLM

llm = LLM("meta-llama/Llama-3-70B-Instruct")

@bentoml.service(
    resources={
        "gpu": 2,
        "gpu_type": "nvidia-h100-80gb",
        "memory": "200Gi",
    },
    traffic={"timeout": 300},
)
class LlamaService:
    def __init__(self):
        self.llm = llm

    @bentoml.api
    async def generate(self, prompt: str) -> str:
        return await self.llm.generate(prompt)

You run bentoml serve locally and the same code runs in Kubernetes via Yatai. No YAML configuration changes. No runtime class configuration. The Python class is the configuration.

Yatai is the Kubernetes operator that receives pushed Bentos from the BentoML CLI and deploys them as Kubernetes workloads. It manages the container build process (converting a Bento to a Docker image), the image registry, and the BentoDeployment lifecycle. The operator handles scaling, rolling updates, and integration with Kubernetes Ingress for traffic routing.

One caveat worth knowing: Yatai's last tagged release is from October 2023, and the repository README notes that Yatai for BentoML 1.2 is currently under construction. For teams self-hosting on Kubernetes, Yatai still works but is not actively updated. BentoML's current first-party deployment path for teams that want a maintained managed experience is BentoCloud. If you are evaluating BentoML for production Kubernetes today, treat Yatai as a stable-but-not-evolving option and factor in the possibility of maintenance gaps when planning long-term.

For teams running many models simultaneously rather than one large LLM, GPU sharing is the key cost lever.

Operator	MIG support	Time-slicing support	MPS support	Multi-model per process	VRAM isolation
KServe	Yes (node selector + DRA)	Via node config	Via node config	No (one model per InferenceService)	Full (per pod)
Seldon Core v2 + MLServer	Yes	Via node config	Via node config	Yes (MLServer multi-model)	Shared (within MLServer process)
BentoML + Yatai	Yes (node selector)	Via node config	Via node config	No (one model per Bento)	Full (per pod)

The practical difference: if you run 10 models averaging 4 GB VRAM each on an 80 GB H100, MLServer can pack them into a single H100 GPU process. KServe and BentoML each require a separate pod and GPU allocation per model, which either wastes GPU memory or requires explicit MIG partitioning. For the serving-level techniques that maximize throughput on that GPU (continuous batching, paged attention), see the LLM serving optimization guide.

The trade-off with MLServer multi-model serving is VRAM isolation. A runaway model inference call that allocates more memory than expected can OOMKill the entire MLServer process, taking all co-located models down. With per-pod isolation (KServe, BentoML), a crashed model pod does not affect other models.

For MIG and time-slicing details at the node level, see the fractional GPU inference guide.

Autoscaling and Scale-to-Zero

Operator	KEDA integration	Scale-to-zero	Cold start (70B, H100)	HPA
KServe (Serverless)	Yes	Native via Knative	2-8 min	Yes (via Knative)
KServe (RawDeployment)	Yes	With KEDA-HTTP	2-8 min	Yes
Seldon Core v2	Yes	No (native)	N/A	Yes
BentoML + Yatai	Limited	No	N/A	Yes

KServe in Serverless mode gives the most complete autoscaling story. The Knative Activator buffers requests while scale-up completes, so callers do not see connection resets when a pod is cold. KEDA handles the metric triggers (queue depth, GPU utilization, custom Prometheus metrics), and Knative handles the routing and scale-to-zero lifecycle.

The cold start caveat matters for 70B models specifically. A 140 GB BF16 model load from local NVMe on an H100 takes 40-50 seconds. CUDA graph capture adds another 25-30 seconds. Even with CRIU checkpoint restore, reaching a warm serving state takes over a minute from a fully cold pod. Scale-to-zero works economically for 70B endpoints only when idle periods are long enough (1+ hours) to justify the user-facing latency on re-wake.

For detailed KEDA trigger configuration and Knative setup, see the KEDA and Knative GPU autoscaling guide, which covers queue-depth triggers, scale-to-zero, and CRIU checkpoint restore for sub-5-second cold starts on vLLM workloads.

Canary, A/B, and Shadow Deployments

Traffic splitting for gradual rollouts is where the three operators differ most in their developer experience.

KServe handles canary via the canaryTrafficPercent field in InferenceService:

yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-70b
  namespace: inference
spec:
  predictor:
    canaryTrafficPercent: 10
    containers:
    - name: kserve-container
      image: vllm/vllm-openai:latest
      args:
        - --model
        - /mnt/models/llama-3-70b-instruct
        - --tensor-parallel-size
        - "2"
      resources:
        limits:
          nvidia.com/gpu: "2"
    volumes:
    - name: model-storage
      persistentVolumeClaim:
        claimName: llama-70b-pvc

Set canaryTrafficPercent: 10 and KServe routes 10% of traffic to the new predictor, 90% to the stable one. Promote by increasing to 100. Rollback by setting to 0. The Knative traffic layer handles the split.

Seldon Core v2 uses Pipeline routing with weighted splits. You define two Model CRDs (old version, new version) and wire them into a Pipeline with an explicit routing node:

yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: llama-canary
spec:
  steps:
    - name: router
      inputs:
        - pipeline.inputs
    - name: llama-v1
      inputs:
        - router.outputs.v1
    - name: llama-v2
      inputs:
        - router.outputs.v2
  output:
    steps:
      - llama-v1
      - llama-v2

The router node handles the traffic split logic. This gives more flexibility (conditional routing based on request headers, shadow mode that sends a copy to the new model without affecting the response) but requires more YAML.

BentoML + Yatai relies on Kubernetes Ingress-level traffic splitting. You deploy two BentoDeployments (old, new) and configure your Ingress controller (nginx, Istio, Envoy Gateway) to split traffic between them. No operator-native traffic management. You get the full flexibility of your Ingress controller but no built-in promotion or rollback workflow.

GPU Support Matrix

All three operators run on standard NVIDIA GPU-backed Kubernetes nodes. The differences are in how they expose GPU hardware features.

Feature	KServe	Seldon Core v2	BentoML + Yatai
NVIDIA Operator	Required	Required	Required
MIG support	Yes (node selector + DRA v1beta2)	Yes (node selector)	Yes (node selector)
Time-slicing	Via NVIDIA device plugin	Via NVIDIA device plugin	Via NVIDIA device plugin
RDMA/InfiniBand	Yes (pod annotation)	Yes (pod annotation)	Yes (pod annotation)
H100 SXM5 tested	Yes	Yes	Yes
Multi-node serving	Via Torchrun sidecar (experimental)	No	Via Ray backend (experimental)

All three run on bare-metal GPU nodes without vendor-specific hooks or proprietary cloud APIs. The NVIDIA device plugin and CUDA toolkit installation on the node are the only requirements.

Observability: Metrics, Tracing, and Payload Logging

Observability gaps are the most common reason teams switch operators post-deployment. A framework that does not export the right metrics requires instrumenting your own serving code, which defeats the purpose of using an operator.

Feature	KServe	Seldon Core v2	BentoML + Yatai
Prometheus endpoint	Yes (per InferenceService)	Yes (MLServer + Seldon metrics)	Yes (BentoML metrics)
Token/sec metric	Via vLLM metrics	Via MLServer (limited)	Via OpenLLM metrics
Queue depth metric	Yes (Knative queue metrics)	Yes (MLServer queue)	Limited
OpenTelemetry traces	Yes (via OTLP sidecar)	Yes (built-in)	Yes (manual instrumentation)
Payload capture	Via explainer hooks	Yes (Seldon payload logging)	Via custom middleware
Grafana dashboard	Community dashboards available	Seldon-provided dashboard	Manual

KServe's queue depth metric (via Knative) is the most useful for autoscaling trigger configuration. MLServer in Seldon Core v2 exports a rich metrics surface for multi-model deployments. BentoML's metrics coverage is adequate for single-model endpoints but weaker for fleet-level visibility.

Payload logging (capturing input/output payloads for audit, debugging, or dataset collection) is an area where Seldon Core v2 has the strongest built-in support. Payload logging is compliance-relevant for regulated industries and Seldon handles it natively at the operator level.

For production SLO definitions and latency budgeting on top of these metrics, see the LLM Inference SLO engineering guide.

Step-by-Step: Deploy Llama 3 70B on Spheron H100 Nodes

Llama 3 70B in BF16 needs about 140 GB of GPU VRAM. Two H100 SXM5 80 GB GPUs cover this with room for KV cache. All three deployments below use Spheron H100 instances with 2x GPU per replica. The Helm commands and kubectl sequences below were verified on Kubernetes 1.30 with the NVIDIA device plugin v0.15.

Option A: KServe InferenceService with vLLM Runtime

Install KServe with the cluster-serverless profile (includes Knative):

bash

kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve-cluster-resources.yaml

Define the ClusterServingRuntime for vLLM:

yaml

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-vllm-runtime
spec:
  annotations:
    prometheus.kserve.io/port: "8080"
    prometheus.kserve.io/path: "/metrics"
  supportedModelFormats:
    - name: vllm
      version: "1"
      autoSelect: true
  containers:
    - name: kserve-container
      image: vllm/vllm-openai:v0.5.0
      command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
      args:
        - --port
        - "8080"
        - --model
        - /mnt/models
        - --served-model-name
        - llama-3-70b
      resources:
        requests:
          cpu: "8"
          memory: "64Gi"
        limits:
          nvidia.com/gpu: "2"

Create the InferenceService:

yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-70b
  namespace: inference
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      storageUri: pvc://llama-70b-pvc
      runtime: kserve-vllm-runtime
      args:
        - --tensor-parallel-size
        - "2"
      resources:
        limits:
          nvidia.com/gpu: "2"

Apply and verify:

bash

kubectl apply -f inferenceservice.yaml -n inference
kubectl get inferenceservice llama-3-70b -n inference
# Wait for READY: True

Option B: Seldon Core v2 Model and Pipeline CRD

Install Seldon Core v2:

bash

helm repo add seldon https://storage.googleapis.com/seldon-charts
helm repo update
helm install seldon-core-v2 seldon/seldon-core-v2 \
  --namespace seldon-system \
  --create-namespace \
  --set controller.clusterwide=true

Create the Model CRD for Llama 3 70B:

yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: llama-3-70b
  namespace: inference
spec:
  storageUri: "pvc://llama-70b-pvc"
  requirements:
    - llama
  memory: 160000000000
  scaling:
    replicas: 1
    minReplicas: 1
    maxReplicas: 4

For a multi-step pipeline adding a prompt sanitizer before the LLM:

yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: llama-with-sanitizer
  namespace: inference
spec:
  steps:
    - name: sanitizer
      inputs:
        - pipeline.inputs
    - name: llama-3-70b
      inputs:
        - sanitizer.outputs
  output:
    steps:
      - llama-3-70b

Apply and check:

bash

kubectl apply -f model.yaml -n inference
kubectl apply -f pipeline.yaml -n inference
kubectl get model llama-3-70b -n inference

Option C: BentoML Service and Yatai Deployment

Define the service in Python:

python

# service.py
import bentoml
from openllm import LLM

llm = LLM("meta-llama/Llama-3-70B-Instruct")

@bentoml.service(
    resources={
        "gpu": 2,
        "gpu_type": "nvidia-h100-80gb",
        "memory": "200Gi",
    },
    traffic={"timeout": 300},
)
class LlamaService:
    def __init__(self):
        self.llm = llm

    @bentoml.api
    async def generate(self, prompt: str) -> str:
        return await self.llm.generate(prompt)

Build and push to Yatai:

bash

# Build the Bento archive
bentoml build service:LlamaService

# Push to Yatai registry
bentoml yatai push llama3-70b-service:latest

Create the BentoDeployment:

yaml

apiVersion: serving.yatai.ai/v2alpha1
kind: BentoDeployment
metadata:
  name: llama3-70b
  namespace: inference
spec:
  bento: llama3-70b-service:latest
  replicas: 1
  resources:
    requests:
      cpu: "8"
      memory: "64Gi"
    limits:
      nvidia.com/gpu: "2"
  autoscaling:
    minReplicas: 1
    maxReplicas: 4

Cost Comparison: Spheron H100 vs Hyperscaler Managed Kubernetes

For a 2x H100 SXM5 inference deployment running 24/7:

Provider	Config	Hourly rate	Monthly (720 hr)
Spheron	2x H100 SXM5 on-demand	$6.24/hr	$4,493
Spheron	2x H100 SXM5 spot	$2.28/hr	$1,642
AWS EKS	p5.48xlarge (8x H100 SXM5, sold only as 8-GPU bundle)	~$98/hr	~$70,560
GCP GKE	a3-highgpu-8g (8x H100, sold only as 8-GPU bundle; per-GPU rate x2)	~$21.50/hr	~$15,480

Spheron H100 SXM5 spot pricing runs at roughly one-third of on-demand ($1.14/hr per GPU vs $3.12/hr per GPU), so spot is the right target for interruption-tolerant batch or async inference. H200 spot is also available at a meaningful discount: $1.40/hr per GPU vs $4.62/hr on-demand, making H200 the better spot target for workloads running large context windows that benefit from the extra VRAM.

AWS does not offer a standalone 2x H100 instance. The closest option is a full p5.48xlarge (8x H100) at approximately $98/hr, which you share across the team or pay for in full. This makes the effective per-GPU cost significantly higher than the per-GPU rate implies when you cannot fill the node. GCP's A3 pricing applies similarly.

Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing for live rates.

Decision Matrix: When to Use Which Operator

Use case	KServe	Seldon Core v2	BentoML + Yatai
Small team, rapid prototyping	Acceptable	Avoid	Recommended
Multi-step inference pipeline (preprocess, predict, explain)	Avoid	Recommended	Avoid
LLM-as-a-service endpoint	Recommended	Acceptable	Acceptable
Event-driven async inference	Avoid	Recommended	Avoid
Strict CNCF governance requirements	Recommended	Avoid	Avoid
Python-first ML team	Acceptable	Avoid	Recommended
Drift monitoring and explainability	Acceptable (plugin)	Recommended	Avoid
Scale-to-zero required	Recommended	Avoid	Avoid

The single most common wrong choice: teams with a Python-first ML culture selecting KServe because it is the "standard" CNCF operator, then spending weeks debugging InferenceService YAML and Knative routing rules that their data scientists cannot understand or modify. BentoML exists precisely for teams that write Python and want their infrastructure to reflect that.

Conversely, teams building production inference APIs for regulated industries (healthcare, finance) where payload logging, drift detection, and explainability are requirements almost always end up at Seldon Core v2. KServe can do these things through plugin integration, but Seldon builds them in.

Migration Paths and Backend Bridges

KServe to Seldon Core v2: Map InferenceService.spec.predictor to a Seldon Model CRD. The serving runtime container stays the same; only the CRD wrapping changes. The main work is converting traffic splitting logic (canaryTrafficPercent) to Seldon Pipeline routing nodes.

BentoML to KServe: A Bento built by BentoML produces a Docker image. That image can be referenced directly as a custom runtime container in a KServe InferenceService. The migration is primarily moving serving configuration from Python decorators to InferenceService YAML.

KServe to Ray Serve: KServe's InferenceGraph can route to external Ray Serve endpoints. This lets you run stateful multi-actor serving logic in Ray Serve while using KServe for traffic management and observability. See the Ray Serve GPU cloud LLM deployment guide for the Ray Serve setup.

KServe with Triton backend: KServe has a first-class ClusterServingRuntime for NVIDIA Triton. The InferenceService spec stays the same; only the container changes to nvcr.io/nvidia/tritonserver. See the Triton Inference Server deployment guide for model repository setup.

KServe, Seldon Core, and BentoML all run on Spheron's bare-metal GPU nodes without vendor lock-in. Spin up H100 or H200 capacity in under two minutes and run any operator against your own cluster.
Spheron H100 SXM5 | H200 on Spheron → | View GPU pricing →

STEPS / 04

Quick Setup Guide

Install KServe on a GPU Kubernetes cluster
Install the KServe Helm chart with the cluster-serverless profile. Configure the InferenceService CRD with a vLLM runtime container and an H100 GPU resource limit. Point the storage URI to your model bucket or use a PVC with pre-pulled weights.
Deploy Llama 3 70B using Seldon Core v2 with MLServer
Install Seldon Core v2 via Helm with the MLServer runtime enabled. Create a Model CRD referencing your Llama 3 70B weights and set the GPU request to 2 x H100 SXM5 for adequate VRAM. Create a Pipeline CRD if you need multi-step inference with pre/post-processing.
Package and deploy Llama 3 70B as a BentoML service with Yatai
Write a Python service class with the @bentoml.service decorator, define the LLM endpoint using OpenLLM, and run bentoml build to create the Bento archive. Push the Bento to the Yatai registry, then create a BentoDeployment CRD with 2 x H100 replicas.
Configure KEDA autoscaling for queue-depth-based scaling
For KServe, install KEDA and configure a ScaledObject targeting the InferenceService queue depth metric. Set minReplicaCount to 0 for scale-to-zero behavior. For Seldon Core v2, create a KEDA ScaledObject against the Prometheus queue-depth metric exported by MLServer.

FAQ / 05

Frequently Asked Questions

KServe is a CNCF Incubating project that standardizes ML model serving on Kubernetes via the InferenceService CRD. It handles autoscaling (including scale-to-zero via Knative), canary rollouts, and pluggable runtime backends (vLLM, Triton, HuggingFace TGI). Use KServe when you want a spec-driven, CNCF-aligned operator that integrates tightly with Knative for serverless inference and supports the ModelCar container pattern for large model loading.

Seldon Core v2 introduces the Inference Pipeline model: you compose multi-step inference graphs (route, transform, explain, monitor) by wiring together individual inference nodes in a DAG. The core abstraction is the Model and Pipeline CRD instead of InferenceService. It ships with a native multi-model server (MLServer) that runs multiple models on a single GPU process, and full Kafka-based async pipeline support. Use Seldon Core v2 when you need stateful multi-step pipelines, built-in drift detection via Alibi Detect, or async inference over message queues.

BentoML is a Python-native framework for packaging ML models as self-contained Bento archives (model + code + dependencies). Yatai is the Kubernetes operator that deploys Bentos as scalable inference services. The differentiation is developer ergonomics: you define your serving logic in Python with decorators, run it locally identically to how it runs in production, then push the Bento to Yatai. Use BentoML + Yatai when your team writes custom pre/post-processing logic, needs first-class LLM batching via OpenLLM, or wants to iterate quickly without YAML-heavy operator configuration.

KServe supports fractional GPU via Kubernetes resource requests and can use NVIDIA MIG with node selectors or DRA resource claims. Seldon Core v2 with MLServer supports multi-model serving on a single GPU, enabling de facto GPU sharing across models. BentoML exposes GPU resources via standard Kubernetes resource limits; fractional GPU requires NVIDIA MPS or time-slicing configured at the node level, with resource requests passed through. All three operators work with NVIDIA MIG when the cluster has MIG partitions configured.

KServe has the most mature Kubernetes-native autoscaling via KEDA and Knative: it supports request-based, custom-metric (queue depth, GPU utilization), and scale-to-zero with Knative Serving. Seldon Core v2 uses HPA and KEDA but lacks scale-to-zero in the core operator - you need additional configuration. BentoML + Yatai uses HPA with custom metrics. For cold-start-sensitive LLM workloads, KServe with Knative is the most complete solution, but cold start times for 70B models still exceed 2 minutes regardless of operator.

TL;DR

Why Kubernetes-Native ML Serving Operators Matter

Architecture Overview

KServe: InferenceService and Serverless Backends

Seldon Core v2: Inference Pipelines and MLServer

BentoML + Yatai: Python-Native Packaging to Kubernetes

Multi-Model Serving and GPU Sharing

Autoscaling and Scale-to-Zero

Canary, A/B, and Shadow Deployments

GPU Support Matrix

Observability: Metrics, Tracing, and Payload Logging

Step-by-Step: Deploy Llama 3 70B on Spheron H100 Nodes

Option A: KServe InferenceService with vLLM Runtime

Option B: Seldon Core v2 Model and Pipeline CRD

Option C: BentoML Service and Yatai Deployment

Cost Comparison: Spheron H100 vs Hyperscaler Managed Kubernetes

Decision Matrix: When to Use Which Operator

Migration Paths and Backend Bridges

Quick Setup Guide

Install KServe on a GPU Kubernetes cluster

Deploy Llama 3 70B using Seldon Core v2 with MLServer

Package and deploy Llama 3 70B as a BentoML service with Yatai

Configure KEDA autoscaling for queue-depth-based scaling

Frequently Asked Questions

01What is KServe and when should I use it?

02What is Seldon Core v2 and how does it differ from KServe?

03What is BentoML and Yatai?

04Do KServe, Seldon Core, and BentoML support fractional GPU and MIG?

05Which ML serving operator has the best autoscaling for LLM workloads?

Build what's next.