Teams evaluating ML serving operators spend more time on architecture than benchmarks. KServe, Seldon Core v2, and BentoML all serve models on Kubernetes. Each one optimizes for a different set of problems. KServe gives you the tightest CNCF alignment with Knative-native scale-to-zero. Seldon Core v2 handles multi-step inference pipelines with built-in drift detection. BentoML gives Python developers the fastest path from local development to a production Kubernetes deployment. Which one fits your stack depends on what you are actually building.
TL;DR
| KServe | Seldon Core v2 | BentoML + Yatai | |
|---|---|---|---|
| Core abstraction | InferenceService CRD | Model + Pipeline CRD | Bento archive + BentoDeployment |
| Multi-model serving | Runtime-based isolation | MLServer (multiple models per process) | Per-Bento isolation |
| Scale-to-zero | Native (via Knative) | No (needs external config) | No |
| GPU sharing model | MIG + node selectors, DRA | Multi-model per GPU (MLServer), MIG | MPS/time-slicing at node level |
| Best-fit use case | LLM endpoints, CNCF-aligned orgs | Multi-step pipelines, drift monitoring | Python-first teams, quick iteration |
| CNCF status | Incubating | Not a CNCF project | Not a CNCF project |
Why Kubernetes-Native ML Serving Operators Matter
Running a model as a plain Kubernetes Deployment works for prototypes. For production, it falls apart in predictable ways. You handle no readiness logic beyond HTTP 200, so a pod that has loaded its container but not yet warmed its CUDA context receives live traffic and fails silently. Traffic splitting for model versioning requires an Ingress configuration that has no awareness of ML-specific headers or batch semantics. Rollback means manual redeployment with no state tracking of which model version served which traffic window. Observability is whatever your application emits; there is no standard metric surface for tokens per second, queue depth, or TTFT.
Kubernetes ML serving operators exist to fix that gap. They introduce CRDs that understand model serving semantics: version tracking, traffic splitting, runtime backend selection, VRAM-aware scheduling, and readiness probes that wait for model warm-up rather than just container health. The best operators also surface Prometheus metrics automatically and integrate with the GPU cluster's autoscaling layer.
The failure modes without an operator are real. A static VRAM allocation based on peak load wastes 40-60% of GPU memory during off-peak hours when requests are bursty. No canary rollback means a bad model version affects 100% of users until a manual redeployment completes. No queue-depth autoscaling means either over-provisioning (expensive) or under-provisioning (latency spikes). These are not hypothetical; teams hit them at the 1,000-request-per-day scale, not just at hyperscaler volumes.
For the GPU cluster scheduling layer that sits underneath these operators, the Kubernetes GPU Orchestration 2026 guide covers DRA, KAI Scheduler, and Grove for handling resource allocation and scheduling.
Architecture Overview
KServe: InferenceService and Serverless Backends
KServe started as KFServing inside the Kubeflow project and became independent in 2021. It is now a CNCF Incubating project (promoted from Sandbox in September 2025). The central resource is InferenceService, a CRD that describes a model version, its runtime backend, storage location, and scaling behavior.
KServe supports two deployment modes.
Serverless mode uses Knative Serving as the underlying transport layer. Traffic flows through the Knative Activator, which buffers requests during scale-to-zero and routes them to warm pods. This is the right mode for endpoints that receive bursty or unpredictable traffic where the idle cost of a warm pod is not justified.
RawDeployment mode uses standard Kubernetes Deployments and Services. No Knative dependency. Scale-to-zero is not available, but there is also no Knative overhead in the request path. This is the right mode for high-throughput LLM endpoints that need predictable latency.
KServe's pluggable runtime model lets you point an InferenceService at a vLLM container, a Triton container, or a HuggingFace TGI container without changing the CRD spec. The ClusterServingRuntime resource defines available backends cluster-wide, and the InferenceService references the runtime by name.
The ModelCar pattern is worth understanding for large LLM deployment. Instead of pulling weights from remote storage (S3, GCS) at pod startup, ModelCar stores the model as an init container image. The init container runs once, copies weights to a shared volume, and exits. The serving container reads from the local volume. This eliminates the remote fetch on every cold start. For a 140 GB Llama 3 70B model, this difference is 4-6 minutes (remote NFS fetch at 400-600 MB/s) vs 40 seconds (local NVMe at 3-4 GB/s).
Seldon Core v2: Inference Pipelines and MLServer
Seldon Core v2 is a complete rewrite from v1, not an incremental update. It switches the core abstraction from a single model endpoint to an inference pipeline. The two main CRDs are Model and Pipeline.
A Model CRD defines a single model loaded into a server process. A Pipeline CRD wires multiple models together in a directed acyclic graph: input goes to a preprocessor, then to a main model, then optionally to an explainer, then to an output. This DAG topology is first-class in the operator, not an afterthought.
The native server is MLServer, an open-source multi-model server that runs multiple ML models in a single process. MLServer supports scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow, and HuggingFace runtimes. For GPU workloads, a single MLServer instance can load and serve multiple smaller models (7B, 13B) on one GPU, sharing GPU memory between them. This is relevant for teams running a portfolio of smaller models rather than one large one.
Kafka integration is a distinguishing feature. Seldon Core v2 supports async inference over Kafka topics natively. A model can be configured to consume from a Kafka input topic and write predictions to a Kafka output topic. This unlocks event-driven inference patterns that synchronous HTTP serving cannot express.
Alibi Detect integration is built in. Alibi Detect is Seldon's open-source library for outlier detection, adversarial detection, and concept drift monitoring. With Seldon Core v2, you add a drift detector as a node in the pipeline graph and it runs inline with inference requests without a separate deployment.
BentoML + Yatai: Python-Native Packaging to Kubernetes
BentoML is a model packaging framework. The core concept is a Bento: a self-contained archive containing the model weights, the serving code, the Python dependencies, and the runtime configuration. A Bento built on your laptop runs identically in Kubernetes, because the full serving environment is captured in the archive.
The serving code is written in Python with decorators:
import bentoml
from openllm import LLM
llm = LLM("meta-llama/Llama-3-70B-Instruct")
@bentoml.service(
resources={
"gpu": 2,
"gpu_type": "nvidia-h100-80gb",
"memory": "200Gi",
},
traffic={"timeout": 300},
)
class LlamaService:
def __init__(self):
self.llm = llm
@bentoml.api
async def generate(self, prompt: str) -> str:
return await self.llm.generate(prompt)You run bentoml serve locally and the same code runs in Kubernetes via Yatai. No YAML configuration changes. No runtime class configuration. The Python class is the configuration.
Yatai is the Kubernetes operator that receives pushed Bentos from the BentoML CLI and deploys them as Kubernetes workloads. It manages the container build process (converting a Bento to a Docker image), the image registry, and the BentoDeployment lifecycle. The operator handles scaling, rolling updates, and integration with Kubernetes Ingress for traffic routing.
One caveat worth knowing: Yatai's last tagged release is from October 2023, and the repository README notes that Yatai for BentoML 1.2 is currently under construction. For teams self-hosting on Kubernetes, Yatai still works but is not actively updated. BentoML's current first-party deployment path for teams that want a maintained managed experience is BentoCloud. If you are evaluating BentoML for production Kubernetes today, treat Yatai as a stable-but-not-evolving option and factor in the possibility of maintenance gaps when planning long-term.
Multi-Model Serving and GPU Sharing
For teams running many models simultaneously rather than one large LLM, GPU sharing is the key cost lever.
| Operator | MIG support | Time-slicing support | MPS support | Multi-model per process | VRAM isolation |
|---|---|---|---|---|---|
| KServe | Yes (node selector + DRA) | Via node config | Via node config | No (one model per InferenceService) | Full (per pod) |
| Seldon Core v2 + MLServer | Yes | Via node config | Via node config | Yes (MLServer multi-model) | Shared (within MLServer process) |
| BentoML + Yatai | Yes (node selector) | Via node config | Via node config | No (one model per Bento) | Full (per pod) |
The practical difference: if you run 10 models averaging 4 GB VRAM each on an 80 GB H100, MLServer can pack them into a single H100 GPU process. KServe and BentoML each require a separate pod and GPU allocation per model, which either wastes GPU memory or requires explicit MIG partitioning. For the serving-level techniques that maximize throughput on that GPU (continuous batching, paged attention), see the LLM serving optimization guide.
The trade-off with MLServer multi-model serving is VRAM isolation. A runaway model inference call that allocates more memory than expected can OOMKill the entire MLServer process, taking all co-located models down. With per-pod isolation (KServe, BentoML), a crashed model pod does not affect other models.
For MIG and time-slicing details at the node level, see the fractional GPU inference guide.
Autoscaling and Scale-to-Zero
| Operator | KEDA integration | Scale-to-zero | Cold start (70B, H100) | HPA |
|---|---|---|---|---|
| KServe (Serverless) | Yes | Native via Knative | 2-8 min | Yes (via Knative) |
| KServe (RawDeployment) | Yes | With KEDA-HTTP | 2-8 min | Yes |
| Seldon Core v2 | Yes | No (native) | N/A | Yes |
| BentoML + Yatai | Limited | No | N/A | Yes |
KServe in Serverless mode gives the most complete autoscaling story. The Knative Activator buffers requests while scale-up completes, so callers do not see connection resets when a pod is cold. KEDA handles the metric triggers (queue depth, GPU utilization, custom Prometheus metrics), and Knative handles the routing and scale-to-zero lifecycle.
The cold start caveat matters for 70B models specifically. A 140 GB BF16 model load from local NVMe on an H100 takes 40-50 seconds. CUDA graph capture adds another 25-30 seconds. Even with CRIU checkpoint restore, reaching a warm serving state takes over a minute from a fully cold pod. Scale-to-zero works economically for 70B endpoints only when idle periods are long enough (1+ hours) to justify the user-facing latency on re-wake.
For detailed KEDA trigger configuration and Knative setup, see the KEDA and Knative GPU autoscaling guide, which covers queue-depth triggers, scale-to-zero, and CRIU checkpoint restore for sub-5-second cold starts on vLLM workloads.
Canary, A/B, and Shadow Deployments
Traffic splitting for gradual rollouts is where the three operators differ most in their developer experience.
KServe handles canary via the canaryTrafficPercent field in InferenceService:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-70b
namespace: inference
spec:
predictor:
canaryTrafficPercent: 10
containers:
- name: kserve-container
image: vllm/vllm-openai:latest
args:
- --model
- /mnt/models/llama-3-70b-instruct
- --tensor-parallel-size
- "2"
resources:
limits:
nvidia.com/gpu: "2"
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: llama-70b-pvcSet canaryTrafficPercent: 10 and KServe routes 10% of traffic to the new predictor, 90% to the stable one. Promote by increasing to 100. Rollback by setting to 0. The Knative traffic layer handles the split.
Seldon Core v2 uses Pipeline routing with weighted splits. You define two Model CRDs (old version, new version) and wire them into a Pipeline with an explicit routing node:
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: llama-canary
spec:
steps:
- name: router
inputs:
- pipeline.inputs
- name: llama-v1
inputs:
- router.outputs.v1
- name: llama-v2
inputs:
- router.outputs.v2
output:
steps:
- llama-v1
- llama-v2The router node handles the traffic split logic. This gives more flexibility (conditional routing based on request headers, shadow mode that sends a copy to the new model without affecting the response) but requires more YAML.
BentoML + Yatai relies on Kubernetes Ingress-level traffic splitting. You deploy two BentoDeployments (old, new) and configure your Ingress controller (nginx, Istio, Envoy Gateway) to split traffic between them. No operator-native traffic management. You get the full flexibility of your Ingress controller but no built-in promotion or rollback workflow.
GPU Support Matrix
All three operators run on standard NVIDIA GPU-backed Kubernetes nodes. The differences are in how they expose GPU hardware features.
| Feature | KServe | Seldon Core v2 | BentoML + Yatai |
|---|---|---|---|
| NVIDIA Operator | Required | Required | Required |
| MIG support | Yes (node selector + DRA v1beta2) | Yes (node selector) | Yes (node selector) |
| Time-slicing | Via NVIDIA device plugin | Via NVIDIA device plugin | Via NVIDIA device plugin |
| RDMA/InfiniBand | Yes (pod annotation) | Yes (pod annotation) | Yes (pod annotation) |
| H100 SXM5 tested | Yes | Yes | Yes |
| Multi-node serving | Via Torchrun sidecar (experimental) | No | Via Ray backend (experimental) |
All three run on bare-metal GPU nodes without vendor-specific hooks or proprietary cloud APIs. The NVIDIA device plugin and CUDA toolkit installation on the node are the only requirements.
Observability: Metrics, Tracing, and Payload Logging
Observability gaps are the most common reason teams switch operators post-deployment. A framework that does not export the right metrics requires instrumenting your own serving code, which defeats the purpose of using an operator.
| Feature | KServe | Seldon Core v2 | BentoML + Yatai |
|---|---|---|---|
| Prometheus endpoint | Yes (per InferenceService) | Yes (MLServer + Seldon metrics) | Yes (BentoML metrics) |
| Token/sec metric | Via vLLM metrics | Via MLServer (limited) | Via OpenLLM metrics |
| Queue depth metric | Yes (Knative queue metrics) | Yes (MLServer queue) | Limited |
| OpenTelemetry traces | Yes (via OTLP sidecar) | Yes (built-in) | Yes (manual instrumentation) |
| Payload capture | Via explainer hooks | Yes (Seldon payload logging) | Via custom middleware |
| Grafana dashboard | Community dashboards available | Seldon-provided dashboard | Manual |
KServe's queue depth metric (via Knative) is the most useful for autoscaling trigger configuration. MLServer in Seldon Core v2 exports a rich metrics surface for multi-model deployments. BentoML's metrics coverage is adequate for single-model endpoints but weaker for fleet-level visibility.
Payload logging (capturing input/output payloads for audit, debugging, or dataset collection) is an area where Seldon Core v2 has the strongest built-in support. Payload logging is compliance-relevant for regulated industries and Seldon handles it natively at the operator level.
For production SLO definitions and latency budgeting on top of these metrics, see the LLM Inference SLO engineering guide.
Step-by-Step: Deploy Llama 3 70B on Spheron H100 Nodes
Llama 3 70B in BF16 needs about 140 GB of GPU VRAM. Two H100 SXM5 80 GB GPUs cover this with room for KV cache. All three deployments below use Spheron H100 instances with 2x GPU per replica. The Helm commands and kubectl sequences below were verified on Kubernetes 1.30 with the NVIDIA device plugin v0.15.
Option A: KServe InferenceService with vLLM Runtime
Install KServe with the cluster-serverless profile (includes Knative):
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve.yaml
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.13.0/kserve-cluster-resources.yamlDefine the ClusterServingRuntime for vLLM:
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: kserve-vllm-runtime
spec:
annotations:
prometheus.kserve.io/port: "8080"
prometheus.kserve.io/path: "/metrics"
supportedModelFormats:
- name: vllm
version: "1"
autoSelect: true
containers:
- name: kserve-container
image: vllm/vllm-openai:v0.5.0
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- --port
- "8080"
- --model
- /mnt/models
- --served-model-name
- llama-3-70b
resources:
requests:
cpu: "8"
memory: "64Gi"
limits:
nvidia.com/gpu: "2"Create the InferenceService:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-70b
namespace: inference
annotations:
serving.kserve.io/deploymentMode: RawDeployment
spec:
predictor:
model:
modelFormat:
name: vllm
storageUri: pvc://llama-70b-pvc
runtime: kserve-vllm-runtime
args:
- --tensor-parallel-size
- "2"
resources:
limits:
nvidia.com/gpu: "2"Apply and verify:
kubectl apply -f inferenceservice.yaml -n inference
kubectl get inferenceservice llama-3-70b -n inference
# Wait for READY: TrueOption B: Seldon Core v2 Model and Pipeline CRD
Install Seldon Core v2:
helm repo add seldon https://storage.googleapis.com/seldon-charts
helm repo update
helm install seldon-core-v2 seldon/seldon-core-v2 \
--namespace seldon-system \
--create-namespace \
--set controller.clusterwide=trueCreate the Model CRD for Llama 3 70B:
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: llama-3-70b
namespace: inference
spec:
storageUri: "pvc://llama-70b-pvc"
requirements:
- llama
memory: 160000000000
scaling:
replicas: 1
minReplicas: 1
maxReplicas: 4For a multi-step pipeline adding a prompt sanitizer before the LLM:
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: llama-with-sanitizer
namespace: inference
spec:
steps:
- name: sanitizer
inputs:
- pipeline.inputs
- name: llama-3-70b
inputs:
- sanitizer.outputs
output:
steps:
- llama-3-70bApply and check:
kubectl apply -f model.yaml -n inference
kubectl apply -f pipeline.yaml -n inference
kubectl get model llama-3-70b -n inferenceOption C: BentoML Service and Yatai Deployment
Define the service in Python:
# service.py
import bentoml
from openllm import LLM
llm = LLM("meta-llama/Llama-3-70B-Instruct")
@bentoml.service(
resources={
"gpu": 2,
"gpu_type": "nvidia-h100-80gb",
"memory": "200Gi",
},
traffic={"timeout": 300},
)
class LlamaService:
def __init__(self):
self.llm = llm
@bentoml.api
async def generate(self, prompt: str) -> str:
return await self.llm.generate(prompt)Build and push to Yatai:
# Build the Bento archive
bentoml build service:LlamaService
# Push to Yatai registry
bentoml yatai push llama3-70b-service:latestCreate the BentoDeployment:
apiVersion: serving.yatai.ai/v2alpha1
kind: BentoDeployment
metadata:
name: llama3-70b
namespace: inference
spec:
bento: llama3-70b-service:latest
replicas: 1
resources:
requests:
cpu: "8"
memory: "64Gi"
limits:
nvidia.com/gpu: "2"
autoscaling:
minReplicas: 1
maxReplicas: 4Cost Comparison: Spheron H100 vs Hyperscaler Managed Kubernetes
For a 2x H100 SXM5 inference deployment running 24/7:
| Provider | Config | Hourly rate | Monthly (720 hr) |
|---|---|---|---|
| Spheron | 2x H100 SXM5 on-demand | $6.24/hr | $4,493 |
| Spheron | 2x H100 SXM5 spot | $2.28/hr | $1,642 |
| AWS EKS | p5.48xlarge (8x H100 SXM5, sold only as 8-GPU bundle) | ~$98/hr | ~$70,560 |
| GCP GKE | a3-highgpu-8g (8x H100, sold only as 8-GPU bundle; per-GPU rate x2) | ~$21.50/hr | ~$15,480 |
Spheron H100 SXM5 spot pricing runs at roughly one-third of on-demand ($1.14/hr per GPU vs $3.12/hr per GPU), so spot is the right target for interruption-tolerant batch or async inference. H200 spot is also available at a meaningful discount: $1.40/hr per GPU vs $4.62/hr on-demand, making H200 the better spot target for workloads running large context windows that benefit from the extra VRAM.
AWS does not offer a standalone 2x H100 instance. The closest option is a full p5.48xlarge (8x H100) at approximately $98/hr, which you share across the team or pay for in full. This makes the effective per-GPU cost significantly higher than the per-GPU rate implies when you cannot fill the node. GCP's A3 pricing applies similarly.
Pricing fluctuates based on GPU availability. The prices above are based on 28 May 2026 and may have changed. Check current GPU pricing for live rates.
Decision Matrix: When to Use Which Operator
| Use case | KServe | Seldon Core v2 | BentoML + Yatai |
|---|---|---|---|
| Small team, rapid prototyping | Acceptable | Avoid | Recommended |
| Multi-step inference pipeline (preprocess, predict, explain) | Avoid | Recommended | Avoid |
| LLM-as-a-service endpoint | Recommended | Acceptable | Acceptable |
| Event-driven async inference | Avoid | Recommended | Avoid |
| Strict CNCF governance requirements | Recommended | Avoid | Avoid |
| Python-first ML team | Acceptable | Avoid | Recommended |
| Drift monitoring and explainability | Acceptable (plugin) | Recommended | Avoid |
| Scale-to-zero required | Recommended | Avoid | Avoid |
The single most common wrong choice: teams with a Python-first ML culture selecting KServe because it is the "standard" CNCF operator, then spending weeks debugging InferenceService YAML and Knative routing rules that their data scientists cannot understand or modify. BentoML exists precisely for teams that write Python and want their infrastructure to reflect that.
Conversely, teams building production inference APIs for regulated industries (healthcare, finance) where payload logging, drift detection, and explainability are requirements almost always end up at Seldon Core v2. KServe can do these things through plugin integration, but Seldon builds them in.
Migration Paths and Backend Bridges
KServe to Seldon Core v2: Map InferenceService.spec.predictor to a Seldon Model CRD. The serving runtime container stays the same; only the CRD wrapping changes. The main work is converting traffic splitting logic (canaryTrafficPercent) to Seldon Pipeline routing nodes.
BentoML to KServe: A Bento built by BentoML produces a Docker image. That image can be referenced directly as a custom runtime container in a KServe InferenceService. The migration is primarily moving serving configuration from Python decorators to InferenceService YAML.
KServe to Ray Serve: KServe's InferenceGraph can route to external Ray Serve endpoints. This lets you run stateful multi-actor serving logic in Ray Serve while using KServe for traffic management and observability. See the Ray Serve GPU cloud LLM deployment guide for the Ray Serve setup.
KServe with Triton backend: KServe has a first-class ClusterServingRuntime for NVIDIA Triton. The InferenceService spec stays the same; only the container changes to nvcr.io/nvidia/tritonserver. See the Triton Inference Server deployment guide for model repository setup.
KServe, Seldon Core, and BentoML all run on Spheron's bare-metal GPU nodes without vendor lock-in. Spin up H100 or H200 capacity in under two minutes and run any operator against your own cluster.
Quick Setup Guide
Install the KServe Helm chart with the cluster-serverless profile. Configure the InferenceService CRD with a vLLM runtime container and an H100 GPU resource limit. Point the storage URI to your model bucket or use a PVC with pre-pulled weights.
Install Seldon Core v2 via Helm with the MLServer runtime enabled. Create a Model CRD referencing your Llama 3 70B weights and set the GPU request to 2 x H100 SXM5 for adequate VRAM. Create a Pipeline CRD if you need multi-step inference with pre/post-processing.
Write a Python service class with the @bentoml.service decorator, define the LLM endpoint using OpenLLM, and run bentoml build to create the Bento archive. Push the Bento to the Yatai registry, then create a BentoDeployment CRD with 2 x H100 replicas.
For KServe, install KEDA and configure a ScaledObject targeting the InferenceService queue depth metric. Set minReplicaCount to 0 for scale-to-zero behavior. For Seldon Core v2, create a KEDA ScaledObject against the Prometheus queue-depth metric exported by MLServer.
Frequently Asked Questions
KServe is a CNCF Incubating project that standardizes ML model serving on Kubernetes via the InferenceService CRD. It handles autoscaling (including scale-to-zero via Knative), canary rollouts, and pluggable runtime backends (vLLM, Triton, HuggingFace TGI). Use KServe when you want a spec-driven, CNCF-aligned operator that integrates tightly with Knative for serverless inference and supports the ModelCar container pattern for large model loading.
Seldon Core v2 introduces the Inference Pipeline model: you compose multi-step inference graphs (route, transform, explain, monitor) by wiring together individual inference nodes in a DAG. The core abstraction is the Model and Pipeline CRD instead of InferenceService. It ships with a native multi-model server (MLServer) that runs multiple models on a single GPU process, and full Kafka-based async pipeline support. Use Seldon Core v2 when you need stateful multi-step pipelines, built-in drift detection via Alibi Detect, or async inference over message queues.
BentoML is a Python-native framework for packaging ML models as self-contained Bento archives (model + code + dependencies). Yatai is the Kubernetes operator that deploys Bentos as scalable inference services. The differentiation is developer ergonomics: you define your serving logic in Python with decorators, run it locally identically to how it runs in production, then push the Bento to Yatai. Use BentoML + Yatai when your team writes custom pre/post-processing logic, needs first-class LLM batching via OpenLLM, or wants to iterate quickly without YAML-heavy operator configuration.
KServe supports fractional GPU via Kubernetes resource requests and can use NVIDIA MIG with node selectors or DRA resource claims. Seldon Core v2 with MLServer supports multi-model serving on a single GPU, enabling de facto GPU sharing across models. BentoML exposes GPU resources via standard Kubernetes resource limits; fractional GPU requires NVIDIA MPS or time-slicing configured at the node level, with resource requests passed through. All three operators work with NVIDIA MIG when the cluster has MIG partitions configured.
KServe has the most mature Kubernetes-native autoscaling via KEDA and Knative: it supports request-based, custom-metric (queue depth, GPU utilization), and scale-to-zero with Knative Serving. Seldon Core v2 uses HPA and KEDA but lacks scale-to-zero in the core operator - you need additional configuration. BentoML + Yatai uses HPA with custom metrics. For cold-start-sensitive LLM workloads, KServe with Knative is the most complete solution, but cold start times for 70B models still exceed 2 minutes regardless of operator.
