MLOps Pipeline Orchestration on GPU Cloud: Kubeflow, ZenML, and Metaflow for AI Training and Fine-Tuning (2026)

Running ML pipelines on neocloud GPU providers is a different problem than using SageMaker Pipelines or Vertex AI Pipelines. You own the orchestrator, you control the node pools, and you decide how to handle spot preemption and checkpoint recovery. This post compares Kubeflow Pipelines, ZenML, and Metaflow for self-hosted MLOps on GPU cloud, covering cluster architecture, spot scheduling, checkpoint volumes, a concrete 4-stage LoRA fine-tuning DAG, and the cost math for mixing on-demand and spot GPU nodes across pipeline stages. MLOps is one of seven layers in the broader AI infrastructure landscape, and compute is the foundation everything else runs on.

Why ML Pipelines on Neocloud GPUs Are Different

SageMaker Pipelines and Vertex AI Pipelines are convenient because the orchestrator, container registry, artifact store, IAM, and compute are all wired together by the provider. You pay for that convenience in two ways: per-step orchestration charges on top of compute costs, and lock-in to a specific model registry, storage format, and scaling policy.

On a neocloud GPU provider, you self-host the orchestrator and pay only for GPU compute time your steps actually consume. There is no per-step surcharge. More importantly, you can use spot GPU pricing for training steps, which cuts the cost of long-running fine-tuning jobs by 20-50% depending on GPU model and provider availability.

The trade-off is that you own the control plane. The Kubeflow Pipelines server, ZenML tracking server, or Metaflow metadata service needs to stay running between pipeline runs. That is real ops overhead, especially for small teams. For a broader view of what Kubernetes GPU orchestration looks like at the cluster level, the Kubernetes GPU orchestration guide covers DRA, KAI Scheduler, and Grove as the scheduling layer underneath MLOps pipelines.

The practical implication: neocloud MLOps makes sense for teams that run pipelines regularly and where the GPU compute cost dominates (usually true for any training or fine-tuning job over an hour). It is harder to justify for teams running weekly one-off jobs where the ops overhead per run is high relative to the savings.

Kubeflow Pipelines vs ZenML vs Metaflow: Architecture Trade-Offs

Orchestrator	Control Plane Overhead	GPU Resource Control	Local Dev Experience	Best For
Kubeflow Pipelines	High (full KFP stack in K8s)	Fine-grained per Op	Poor (needs real cluster)	Platform teams building shared MLOps infra
ZenML	Medium (ZenML server + pluggable orchestrator)	Good via step settings	Good (local orchestrator for testing)	Teams migrating from SageMaker/Vertex
Metaflow	Low (metadata service only)	GPU via @kubernetes decorator	Excellent (steps run locally)	Data-scientist-led teams, Netflix-style DAGs

Kubeflow Pipelines is the most Kubernetes-native option. Every pipeline step becomes a Kubernetes pod with explicit resource requests. You can set gpu_limit, node_selector, and tolerations on individual components, which means a training step can target spot H100 nodes while the eval step targets on-demand A100 nodes in the same pipeline run. The cost: you need to manage the KFP API server, frontend, MySQL backend, and MinIO artifact store in your cluster. For platform teams managing MLOps infra for multiple ML teams, that overhead pays off through reuse.

ZenML acts as an abstraction layer over orchestrators rather than being one itself. You can point the same ZenML pipeline at a local process, a Kubernetes cluster, or Airflow, switching backends without rewriting step code. The ZenML server handles experiment tracking, artifact lineage, and model versioning. This is the path of least resistance for teams that built pipelines on SageMaker or Vertex and want to move to neocloud GPU compute without rebuilding from scratch.

Metaflow is the simplest operationally. Originally built at Netflix for data science workflows, it uses Python decorators to annotate steps with resource requirements. The @kubernetes decorator routes a step to a GPU node on your cluster. Local development is excellent: steps run locally without a cluster, and you push to Kubernetes only when you're ready to scale. The trade-off is less flexibility in DAG topology and fewer built-in integrations compared to Kubeflow.

Setting Up a Kubernetes Cluster on GPU Cloud for Multi-Stage Training DAGs

The right architecture for a multi-stage training pipeline uses two node pools: an on-demand CPU pool for the MLOps control plane and lightweight steps (data prep, registry push), and a spot GPU pool for training and eval steps where preemption is tolerable.

Label on-demand nodes with workload=control and GPU training nodes with workload=training. For eval steps that need a GPU but where rerunning is cheap (under 1 hour), you can add a separate eval node pool with cheaper GPU models. Add a spot taint to the GPU pool so only steps that explicitly tolerate it get scheduled there:

yaml

# Example: GPU training step node selector + spot toleration
resources:
  limits:
    nvidia.com/gpu: "1"
    memory: "120Gi"
    cpu: "16"
nodeSelector:
  workload: training
tolerations:
  - key: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

For setup steps, follow your GPU cloud provider's docs for cluster provisioning and NVIDIA device plugin configuration. The device plugin DaemonSet is required on GPU nodes for the nvidia.com/gpu resource to be visible to the scheduler.

For distributed training stages that need NVLink-connected multi-GPU nodes (70B+ models with FSDP or tensor parallelism), provision NVLink-enabled 8xH100 SXM5 nodes for those steps specifically. The distributed LLM training guide covers multi-node FSDP and NCCL tuning for those configurations.

Pricing context for node pool selection (rates as of 04 May 2026):

H100 SXM5: $3.10/hr on-demand, $0.80/hr spot (best for LoRA fine-tuning, SFT)
A100 80GB: $1.04/hr on-demand (strong for eval steps)
RTX 4090: $0.53/hr on-demand (cost-effective for lightweight eval and inference testing)

Wiring GPU Pods, Spot Nodes, and Checkpoint Volumes into a Reproducible Pipeline

Spot node preemption is the main reliability challenge in GPU pipeline design. When a spot node is reclaimed, the training pod dies mid-epoch. Without checkpoints, the step restarts from scratch. With checkpoints stored on a PVC backed by network storage, the step restarts from the last saved checkpoint.

The critical rule: use PVC-backed network storage for checkpoints, not hostPath volumes. HostPath volumes are local to the node. When the spot node is terminated, the volume disappears with it.

python

# Kubeflow Pipelines v2: checkpoint PVC using kfp-kubernetes
# Install: pip install kfp kfp-kubernetes
from kfp import dsl
from kfp_kubernetes import CreatePVC, MountPVC, add_node_selector_constraint, add_toleration

@dsl.pipeline(name="fine-tuning-pipeline")
def fine_tuning_pipeline():
    pvc_task = CreatePVC(
        pvc_name="checkpoint-pvc",
        access_modes=["ReadWriteOnce"],
        size="100Gi",
    )
    train_task = train_component(checkpoint_path="/checkpoints")
    train_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    MountPVC(task=train_task, pvc_name=pvc_task.output, mount_path="/checkpoints")
    add_node_selector_constraint(train_task, label_key="workload", label_value="training")
    add_toleration(train_task, key="spot", operator="Equal", value="true", effect="NoSchedule")

python

# ZenML: step with GPU node selector
from zenml import step
from zenml.config import ResourceSettings

@step(settings={"orchestrator": {"node_selector": {"workload": "training"}, "resources": ResourceSettings(gpu_count=1, memory="120GB")}})
def train_lora(base_model: str, dataset: str) -> str:
    # training logic - save checkpoint to ZenML artifact store
    ...

python

# Metaflow: GPU step with checkpoint
from metaflow import FlowSpec, step, kubernetes, checkpoint, retry, current

class FineTuningFlow(FlowSpec):
    @retry(times=3)
    @checkpoint
    @kubernetes(gpu=1, cpu=16, memory=120000, node_selector={"workload": "training"})
    @step
    def train(self):
        # Load from checkpoint if resuming
        if current.checkpoint.is_resuming:
            model = load_checkpoint(current.checkpoint.info.path)
        # ... training logic
        self.next(self.evaluate)

For spot-heavy pipelines, checkpoint every N steps (e.g., every 500 training steps), not just at the end of each epoch. For a 4-hour training run on a Spheron H100 spot node with 30% preemption probability, checkpointing every epoch means an average of 1.5 hours of lost work per preemption event. Checkpointing every 500 steps reduces that to minutes. The spot GPU training case study has real numbers on checkpoint overhead and recovery time from a 70B training run.

If your pipelines trigger LLM agent steps that need exactly-once semantics, see the AI agent workflow orchestration guide covering Temporal, Inngest, and Restate.

Fine-Tuning Pipeline Walkthrough: Data Prep, LoRA Training, Eval, Registry Push

Here is a concrete 4-stage pipeline for fine-tuning a 7B-13B model using LoRA. Each stage has different compute requirements and spot tolerance:

Stage	Node Type	GPU	Mode	Est. Duration	Est. Cost
Data prep	CPU	None	On-demand	30 min	~$0.05
LoRA training (8B)	GPU	H100 SXM5	Spot	4 hr	~$3.20
Eval (MMLU/HellaSwag)	GPU	A100 80GB	On-demand	45 min	~$0.78
Registry push	CPU	None	On-demand	5 min	~$0.01

Stage 1 (data prep) loads and tokenizes the dataset on a CPU node. No GPU required. Stage 2 (LoRA training) runs PEFT on a spot H100 or A100, saving adapter weights and checkpoints to the PVC. Stage 3 (eval) runs the lm-eval harness on MMLU and HellaSwag on an on-demand A100. Stage 4 (registry push) merges the adapter into the base model and pushes to Hugging Face Hub or a private registry on a CPU node.

python

# 4-stage Kubeflow fine-tuning pipeline (KFP SDK v2)
# Install: pip install kfp kfp-kubernetes
from kfp import dsl
from kfp_kubernetes import add_node_selector_constraint, add_toleration

@dsl.pipeline(name="lora-fine-tuning")
def lora_pipeline(base_model: str = "meta-llama/Llama-3-8B", dataset: str = "gs://my-bucket/train.jsonl"):
    # Stage 1: Data prep (CPU only)
    prep_task = data_prep_component(dataset=dataset)
    prep_task.set_cpu_limit("8").set_memory_limit("32G")

    # Stage 2: LoRA training (spot H100)
    train_task = lora_train_component(
        processed_data=prep_task.outputs["output"],
        base_model=base_model
    )
    train_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    add_node_selector_constraint(train_task, label_key="workload", label_value="training")
    add_toleration(train_task, key="spot", operator="Equal", value="true", effect="NoSchedule")
    train_task.after(prep_task)

    # Stage 3: Eval (on-demand, cheaper GPU OK)
    eval_task = eval_component(adapter_path=train_task.outputs["adapter"])
    eval_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    add_node_selector_constraint(eval_task, label_key="workload", label_value="eval")
    eval_task.after(train_task)

    # Stage 4: Registry push (CPU)
    push_task = registry_push_component(
        adapter_path=train_task.outputs["adapter"],
        eval_score=eval_task.outputs["score"]
    )
    push_task.after(eval_task)

For multi-GPU training stages (70B+ models, or 8B with aggressive throughput targets), see the NCCL tuning guide for environment variable configuration and topology-aware all-reduce setup.

Cost-Aware Scheduling: Mixing On-Demand and Spot GPUs Across Pipeline Stages

Pipeline stages fall into three categories based on preemption tolerance:

Always spot: Long-running training steps with checkpoint recovery. These are the most expensive steps and get the largest savings from spot pricing. A 4-hour H100 SXM5 training step costs $3.20 on spot vs $12.40 on-demand, a 74% cost reduction that turns multi-hour fine-tuning jobs into a fraction of the on-demand price.

Sometimes spot: Eval steps where rerunning costs under 1 hour. If the eval step is preempted, you lose 45 minutes of compute. Whether that is acceptable depends on how often your spot pool is preempted and whether your pipeline has a hard time constraint.

Never spot: Registry push, metadata writes, final aggregation steps, and anything that writes to an external system. These steps are short, cheap on on-demand, and have side effects that are hard to safely retry.

Pricing comparison for GPU nodes (rates as of 04 May 2026):

GPU	On-Demand	Spot	Savings	Best For
H100 SXM5	$3.10/hr	$0.80/hr	74%	LoRA fine-tuning, SFT
H200 SXM5	$2.51/hr	$1.19/hr	53%	70B+ FSDP training
A100 80GB PCIe	$1.04/hr	$1.14/hr	n/a*	Eval, inference steps
B200 SXM6	N/A	$2.12/hr	N/A	Latest-gen training
RTX 4090	$0.53/hr	N/A	-	Lightweight eval

*A100 80GB PCIe spot ($1.14/hr) is higher than on-demand ($1.04/hr) at current Spheron rates. This is accurate, not a typo. Reservation pressure can make spot more expensive than on-demand for popular GPU models. Verify live pricing before building cost assumptions around A100 spot.

Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing for live rates.

In Kubeflow Pipelines, you can implement cost-aware routing by passing a pipeline parameter for estimated step duration and using dsl.If to route short steps to on-demand nodes and long steps to spot. This is more complex to implement than static node selectors, but it lets the pipeline self-select the cheaper path for variable-length training runs.

Integrating with vLLM, Triton, and Ray Serve for Auto-Deploy After Training

A natural extension of the 4-stage pipeline is a conditional deploy step: if the eval score clears a threshold, automatically push the model to a serving endpoint.

python

# Kubeflow: conditional deploy after eval threshold (KFP SDK v2)
# dsl.Condition is replaced by dsl.If in KFP SDK v2
from kfp import dsl
from kfp_kubernetes import add_node_selector_constraint

with dsl.If(eval_task.outputs["score"] > 0.75, name="deploy-condition"):
    deploy_task = vllm_deploy_component(
        model_path=push_task.outputs["registry_path"]
    )
    deploy_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    add_node_selector_constraint(deploy_task, label_key="workload", label_value="inference")
    deploy_task.after(push_task)

For vLLM: the deploy step runs vllm serve <merged-model-path> on an on-demand H100 SXM5 node. Use an on-demand node for inference, not spot, since serving interruptions affect live traffic. H100 SXM5 on Spheron supports both PCIe and SXM5 variants depending on your memory bandwidth requirements.

For Triton: the deploy step converts the model to TensorRT-LLM engine format, pushes the engine to the Triton model repository, and triggers a model reload via the Triton management API. The conversion step is GPU-intensive and should run on a separate on-demand node before the serving step.

For Ray Serve: use serve.run() with a vLLM deployment class to update a running endpoint in-place. Ray Serve handles the rolling update to avoid dropping requests during the model swap. The pipeline step just calls the Serve HTTP API to trigger the update.

The key constraint: keep the deploy step on an on-demand node and treat it as a "never spot" step. A preemption during model upload or Triton reload leaves the serving endpoint in an inconsistent state.

Once your model is trained, KServe, Seldon Core, and BentoML are the three Kubernetes-native operators worth evaluating for production serving.

When to Pick Each: Decision Matrix by Team Size, Stack, and Workload

Team Type	Orchestrator	Why
Solo researcher / small data science team	Metaflow	Minimal ops, excellent local dev, GPU via decorator
Platform/MLOps team building shared infra for multiple ML teams	Kubeflow Pipelines	Per-Op GPU control, shared cluster, UI for experiment tracking
Team currently on SageMaker Pipelines migrating to neocloud	ZenML	Code reuse, pluggable backends, fast migration path
Team wanting to avoid Kubernetes knowledge requirement	ZenML or Metaflow	Both hide most K8s complexity behind Python APIs
Team needing complex DAG branching and custom UI	Kubeflow	`dsl.If`, `dsl.ParallelFor`, full KFP UI
Team needing fastest iteration cycle (dev to cloud in minutes)	Metaflow	Local execution + @kubernetes = minimal friction

Metaflow works best for data-science-led teams where the ML engineers write training code and the platform team is small or non-existent. The decorator model fits naturally into notebooks and scripts. The main limitation is DAG flexibility: Metaflow's branching and joining is less expressive than KFP's DSL.

Kubeflow Pipelines is the right call when you have a platform team maintaining shared MLOps infrastructure for multiple ML teams. The investment in KFP setup pays off when 5-10 teams are running pipelines on the same cluster. GPU resource control per Op is the strongest of the three tools, which matters when you have heterogeneous node pools with different GPU models.

ZenML makes the most sense for teams with existing pipelines on SageMaker or Vertex that need to move to neocloud without a rewrite. ZenML's orchestrator abstraction means you can migrate step-by-step, running some steps locally and some on Kubernetes while you port the rest. For budget-sensitive teams doing evaluation work, rent A100 on Spheron for eval steps while keeping training on H100 SXM5 spot.

Summary

Kubeflow, ZenML, and Metaflow each solve a different version of the MLOps orchestration problem. Kubeflow is the most powerful and most demanding to operate. Metaflow is the simplest with the best developer experience. ZenML sits in the middle with the best migration story from managed services. For GPU cloud deployments specifically, all three can target spot nodes for training steps and on-demand nodes for eval and deploy, which is the main cost lever.

ML pipeline orchestration on GPU cloud is a real alternative to managed services - you trade some ops overhead for full control over scheduling, pricing, and tooling. Spheron provides the underlying GPU compute layer: spot H100 and H200 nodes for training stages, on-demand A100 nodes for eval, and NVLink multi-GPU nodes for distributed training DAGs.
H100 GPU pricing → | A100 GPU pricing → | View all GPU pricing →
Start a pipeline on Spheron →

STEPS / 05

Quick Setup Guide

Provision a GPU Kubernetes cluster for MLOps
Deploy a Kubernetes cluster on your GPU cloud provider with at least one on-demand node pool for the control plane and MLOps services, and one spot node pool for training steps. Label on-demand nodes with workload=control and GPU nodes with workload=training. Configure the NVIDIA device plugin DaemonSet and a CSI storage driver for PVC-backed checkpoint volumes.
Install and configure your chosen MLOps orchestrator
For Kubeflow Pipelines: apply the standalone KFP manifest and wait for all pods in the kubeflow namespace to reach Running state. For ZenML: pip install zenml, run zenml init, and register a KubernetesOrchestrator pointing at your cluster context. For Metaflow: install the Metaflow service (metadata + artifact store) and configure your compute backend with GPU resource annotations.
Define pipeline steps with GPU resource requests
In Kubeflow Pipelines v2, call set_accelerator_type('nvidia.com/gpu').set_accelerator_limit(1) on each task that needs a GPU, then use kfp_kubernetes add_node_selector_constraint to target GPU node pools. In ZenML, use @step(settings={'orchestrator.kubernetes': {'node_selector': {'workload': 'training'}, 'resources': {'limits': {'nvidia.com/gpu': '1'}}}}) on training steps. In Metaflow, apply @kubernetes(gpu=1, cpu=16, memory=120000) to training steps.
Wire checkpoint volumes into training steps
Create a PersistentVolumeClaim for checkpoint storage. In Kubeflow, mount the PVC as a VolumeOp and pass the mount path to the training component. In ZenML, use the ArtifactStore integration to save model checkpoints as tracked artifacts. In Metaflow, use @checkpoint or write directly to S3 artifact paths at the end of each step.
Configure spot node scheduling and retry policies
For spot training steps: add a toleration for the spot taint and set retry_policy=ALWAYS with max_retries=3 in Kubeflow. In ZenML, set the step retry count in the pipeline run configuration. In Metaflow, apply @retry(times=3). Ensure training steps resume from the latest checkpoint on retry.

FAQ / 05

Frequently Asked Questions

Kubeflow Pipelines runs natively on Kubernetes and is the most Kubernetes-native option, requiring a full KFP control plane. ZenML is an abstraction layer that can target multiple orchestrators (Kubernetes, Airflow, Vertex) and has the fastest local-to-cloud migration path. Metaflow is the simplest for data scientists - it uses decorators and handles resource requests transparently, but has less fine-grained control over DAG topology. For GPU cloud deployments where you control the cluster, Kubeflow gives the most control; ZenML is best for teams migrating from managed services; Metaflow is best for data-science-led teams that want minimal ops overhead.

Deploy a Kubernetes cluster on your GPU cloud provider, install Kubeflow Pipelines using the standalone deployment manifest (not the full KFP distribution), configure a default StorageClass for PVC-based artifact storage, label your GPU nodes with resource.type=gpu, and set node selectors plus resource limits (nvidia.com/gpu: 1) in your pipeline component specs. For spot GPU nodes, add toleration rules in the component YAML so training steps schedule onto preemptible nodes while eval and registry-push steps stay on on-demand nodes.

Yes, with checkpoint-based fault tolerance. Store intermediate model checkpoints to a persistent volume or object storage (S3-compatible) at the end of each epoch or training step. If a spot node is preempted, the pipeline step retries from the last checkpoint rather than from scratch. In Kubeflow Pipelines, configure retry_policy on the component Op. In ZenML, use the @step decorator with retry settings. In Metaflow, use @retry with the @batch decorator. The checkpoint volume must survive the pod termination, so use a PVC backed by network storage, not the node's local disk.

Managed services like SageMaker and Vertex package the orchestrator, container registry, artifact store, and compute into a single billing unit - you pay for convenience and vendor lock-in. On a neocloud GPU provider, you self-host the orchestrator (Kubeflow, ZenML, or Metaflow) and pay only for the GPU compute time your pipeline steps actually run. The trade-off is operational overhead for the control plane, but the per-GPU hourly cost is often 40-60% lower on neoclouds for the same NVIDIA silicon. You also get full control over spot scheduling, node affinity, and checkpoint storage.

It depends on the model size. For LoRA fine-tuning of 7B-13B models, a single A100 80GB or H100 SXM5 per pipeline step is sufficient. For 70B+ models with full FSDP, use multi-node H100 SXM5 or H200 SXM5 steps. For eval-only steps (forward pass, MMLU scoring), H100 PCIe or even RTX 4090 nodes are cost-effective since eval is memory-read-heavy, not compute-bound. Use spot pricing for long training steps and on-demand for short eval and registry-push steps to keep costs predictable.

Why ML Pipelines on Neocloud GPUs Are Different

Kubeflow Pipelines vs ZenML vs Metaflow: Architecture Trade-Offs

Setting Up a Kubernetes Cluster on GPU Cloud for Multi-Stage Training DAGs

Wiring GPU Pods, Spot Nodes, and Checkpoint Volumes into a Reproducible Pipeline

Fine-Tuning Pipeline Walkthrough: Data Prep, LoRA Training, Eval, Registry Push

Cost-Aware Scheduling: Mixing On-Demand and Spot GPUs Across Pipeline Stages

Integrating with vLLM, Triton, and Ray Serve for Auto-Deploy After Training

When to Pick Each: Decision Matrix by Team Size, Stack, and Workload

Summary

Quick Setup Guide

Provision a GPU Kubernetes cluster for MLOps

Install and configure your chosen MLOps orchestrator

Define pipeline steps with GPU resource requests

Wire checkpoint volumes into training steps

Configure spot node scheduling and retry policies

Frequently Asked Questions

01What is the difference between Kubeflow Pipelines, ZenML, and Metaflow for GPU cloud MLOps?

02How do I run Kubeflow Pipelines on a GPU cloud cluster?

03Can I use spot GPU instances for MLOps training pipelines without losing progress?

04How does MLOps on GPU cloud neoclouds differ from SageMaker Pipelines or Vertex AI Pipelines?

05What GPU should I use for multi-stage fine-tuning pipelines?

Try It on Real GPUs