Comparison

MLOps Pipeline Orchestration on GPU Cloud: Kubeflow, ZenML, and Metaflow for AI Training and Fine-Tuning (2026)

Back to BlogWritten by Mitrasish, Co-founderMay 4, 2026
mlops pipeline gpu cloudkubeflow gpu deploymentzenml gpu cloudmetaflow gpu cloudai training pipeline orchestrationkubernetes mlopsgpu pipeline orchestrationGPU CloudLoRA Fine-Tuning PipelineSpot GPU Scheduling
MLOps Pipeline Orchestration on GPU Cloud: Kubeflow, ZenML, and Metaflow for AI Training and Fine-Tuning (2026)

Running ML pipelines on neocloud GPU providers is a different problem than using SageMaker Pipelines or Vertex AI Pipelines. You own the orchestrator, you control the node pools, and you decide how to handle spot preemption and checkpoint recovery. This post compares Kubeflow Pipelines, ZenML, and Metaflow for self-hosted MLOps on GPU cloud, covering cluster architecture, spot scheduling, checkpoint volumes, a concrete 4-stage LoRA fine-tuning DAG, and the cost math for mixing on-demand and spot GPU nodes across pipeline stages.

Why ML Pipelines on Neocloud GPUs Are Different

SageMaker Pipelines and Vertex AI Pipelines are convenient because the orchestrator, container registry, artifact store, IAM, and compute are all wired together by the provider. You pay for that convenience in two ways: per-step orchestration charges on top of compute costs, and lock-in to a specific model registry, storage format, and scaling policy.

On a neocloud GPU provider, you self-host the orchestrator and pay only for GPU compute time your steps actually consume. There is no per-step surcharge. More importantly, you can use spot GPU pricing for training steps, which cuts the cost of long-running fine-tuning jobs by 20-50% depending on GPU model and provider availability.

The trade-off is that you own the control plane. The Kubeflow Pipelines server, ZenML tracking server, or Metaflow metadata service needs to stay running between pipeline runs. That is real ops overhead, especially for small teams. For a broader view of what Kubernetes GPU orchestration looks like at the cluster level, the Kubernetes GPU orchestration guide covers DRA, KAI Scheduler, and Grove as the scheduling layer underneath MLOps pipelines.

The practical implication: neocloud MLOps makes sense for teams that run pipelines regularly and where the GPU compute cost dominates (usually true for any training or fine-tuning job over an hour). It is harder to justify for teams running weekly one-off jobs where the ops overhead per run is high relative to the savings.

Kubeflow Pipelines vs ZenML vs Metaflow: Architecture Trade-Offs

OrchestratorControl Plane OverheadGPU Resource ControlLocal Dev ExperienceBest For
Kubeflow PipelinesHigh (full KFP stack in K8s)Fine-grained per OpPoor (needs real cluster)Platform teams building shared MLOps infra
ZenMLMedium (ZenML server + pluggable orchestrator)Good via step settingsGood (local orchestrator for testing)Teams migrating from SageMaker/Vertex
MetaflowLow (metadata service only)GPU via @kubernetes decoratorExcellent (steps run locally)Data-scientist-led teams, Netflix-style DAGs

Kubeflow Pipelines is the most Kubernetes-native option. Every pipeline step becomes a Kubernetes pod with explicit resource requests. You can set gpu_limit, node_selector, and tolerations on individual components, which means a training step can target spot H100 nodes while the eval step targets on-demand A100 nodes in the same pipeline run. The cost: you need to manage the KFP API server, frontend, MySQL backend, and MinIO artifact store in your cluster. For platform teams managing MLOps infra for multiple ML teams, that overhead pays off through reuse.

ZenML acts as an abstraction layer over orchestrators rather than being one itself. You can point the same ZenML pipeline at a local process, a Kubernetes cluster, or Airflow, switching backends without rewriting step code. The ZenML server handles experiment tracking, artifact lineage, and model versioning. This is the path of least resistance for teams that built pipelines on SageMaker or Vertex and want to move to neocloud GPU compute without rebuilding from scratch.

Metaflow is the simplest operationally. Originally built at Netflix for data science workflows, it uses Python decorators to annotate steps with resource requirements. The @kubernetes decorator routes a step to a GPU node on your cluster. Local development is excellent: steps run locally without a cluster, and you push to Kubernetes only when you're ready to scale. The trade-off is less flexibility in DAG topology and fewer built-in integrations compared to Kubeflow.

Setting Up a Kubernetes Cluster on GPU Cloud for Multi-Stage Training DAGs

The right architecture for a multi-stage training pipeline uses two node pools: an on-demand CPU pool for the MLOps control plane and lightweight steps (data prep, registry push), and a spot GPU pool for training and eval steps where preemption is tolerable.

Label on-demand nodes with workload=control and GPU training nodes with workload=training. For eval steps that need a GPU but where rerunning is cheap (under 1 hour), you can add a separate eval node pool with cheaper GPU models. Add a spot taint to the GPU pool so only steps that explicitly tolerate it get scheduled there:

yaml
# Example: GPU training step node selector + spot toleration
resources:
  limits:
    nvidia.com/gpu: "1"
    memory: "120Gi"
    cpu: "16"
nodeSelector:
  workload: training
tolerations:
  - key: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

For setup steps, follow your GPU cloud provider's docs for cluster provisioning and NVIDIA device plugin configuration. The device plugin DaemonSet is required on GPU nodes for the nvidia.com/gpu resource to be visible to the scheduler.

For distributed training stages that need NVLink-connected multi-GPU nodes (70B+ models with FSDP or tensor parallelism), provision NVLink-enabled 8xH100 SXM5 nodes for those steps specifically. The distributed LLM training guide covers multi-node FSDP and NCCL tuning for those configurations.

Pricing context for node pool selection (rates as of 04 May 2026):

  • H100 SXM5: $3.10/hr on-demand, $0.80/hr spot (best for LoRA fine-tuning, SFT)
  • A100 80GB: $1.04/hr on-demand (strong for eval steps)
  • RTX 4090: $0.53/hr on-demand (cost-effective for lightweight eval and inference testing)

Wiring GPU Pods, Spot Nodes, and Checkpoint Volumes into a Reproducible Pipeline

Spot node preemption is the main reliability challenge in GPU pipeline design. When a spot node is reclaimed, the training pod dies mid-epoch. Without checkpoints, the step restarts from scratch. With checkpoints stored on a PVC backed by network storage, the step restarts from the last saved checkpoint.

The critical rule: use PVC-backed network storage for checkpoints, not hostPath volumes. HostPath volumes are local to the node. When the spot node is terminated, the volume disappears with it.

python
# Kubeflow Pipelines v2: checkpoint PVC using kfp-kubernetes
# Install: pip install kfp kfp-kubernetes
from kfp import dsl
from kfp_kubernetes import CreatePVC, MountPVC, add_node_selector_constraint, add_toleration

@dsl.pipeline(name="fine-tuning-pipeline")
def fine_tuning_pipeline():
    pvc_task = CreatePVC(
        pvc_name="checkpoint-pvc",
        access_modes=["ReadWriteOnce"],
        size="100Gi",
    )
    train_task = train_component(checkpoint_path="/checkpoints")
    train_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    MountPVC(task=train_task, pvc_name=pvc_task.output, mount_path="/checkpoints")
    add_node_selector_constraint(train_task, label_key="workload", label_value="training")
    add_toleration(train_task, key="spot", operator="Equal", value="true", effect="NoSchedule")
python
# ZenML: step with GPU node selector
from zenml import step
from zenml.config import ResourceSettings

@step(settings={"orchestrator": {"node_selector": {"workload": "training"}, "resources": ResourceSettings(gpu_count=1, memory="120GB")}})
def train_lora(base_model: str, dataset: str) -> str:
    # training logic - save checkpoint to ZenML artifact store
    ...
python
# Metaflow: GPU step with checkpoint
from metaflow import FlowSpec, step, kubernetes, checkpoint, retry, current

class FineTuningFlow(FlowSpec):
    @retry(times=3)
    @checkpoint
    @kubernetes(gpu=1, cpu=16, memory=120000, node_selector={"workload": "training"})
    @step
    def train(self):
        # Load from checkpoint if resuming
        if current.checkpoint.is_resuming:
            model = load_checkpoint(current.checkpoint.info.path)
        # ... training logic
        self.next(self.evaluate)

For spot-heavy pipelines, checkpoint every N steps (e.g., every 500 training steps), not just at the end of each epoch. For a 4-hour training run on a Spheron H100 spot node with 30% preemption probability, checkpointing every epoch means an average of 1.5 hours of lost work per preemption event. Checkpointing every 500 steps reduces that to minutes. The spot GPU training case study has real numbers on checkpoint overhead and recovery time from a 70B training run.

Fine-Tuning Pipeline Walkthrough: Data Prep, LoRA Training, Eval, Registry Push

Here is a concrete 4-stage pipeline for fine-tuning a 7B-13B model using LoRA. Each stage has different compute requirements and spot tolerance:

StageNode TypeGPUModeEst. DurationEst. Cost
Data prepCPUNoneOn-demand30 min~$0.05
LoRA training (8B)GPUH100 SXM5Spot4 hr~$3.20
Eval (MMLU/HellaSwag)GPUA100 80GBOn-demand45 min~$0.78
Registry pushCPUNoneOn-demand5 min~$0.01

Stage 1 (data prep) loads and tokenizes the dataset on a CPU node. No GPU required. Stage 2 (LoRA training) runs PEFT on a spot H100 or A100, saving adapter weights and checkpoints to the PVC. Stage 3 (eval) runs the lm-eval harness on MMLU and HellaSwag on an on-demand A100. Stage 4 (registry push) merges the adapter into the base model and pushes to Hugging Face Hub or a private registry on a CPU node.

python
# 4-stage Kubeflow fine-tuning pipeline (KFP SDK v2)
# Install: pip install kfp kfp-kubernetes
from kfp import dsl
from kfp_kubernetes import add_node_selector_constraint, add_toleration

@dsl.pipeline(name="lora-fine-tuning")
def lora_pipeline(base_model: str = "meta-llama/Llama-3-8B", dataset: str = "gs://my-bucket/train.jsonl"):
    # Stage 1: Data prep (CPU only)
    prep_task = data_prep_component(dataset=dataset)
    prep_task.set_cpu_limit("8").set_memory_limit("32G")

    # Stage 2: LoRA training (spot H100)
    train_task = lora_train_component(
        processed_data=prep_task.outputs["output"],
        base_model=base_model
    )
    train_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    add_node_selector_constraint(train_task, label_key="workload", label_value="training")
    add_toleration(train_task, key="spot", operator="Equal", value="true", effect="NoSchedule")
    train_task.after(prep_task)

    # Stage 3: Eval (on-demand, cheaper GPU OK)
    eval_task = eval_component(adapter_path=train_task.outputs["adapter"])
    eval_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    add_node_selector_constraint(eval_task, label_key="workload", label_value="eval")
    eval_task.after(train_task)

    # Stage 4: Registry push (CPU)
    push_task = registry_push_component(
        adapter_path=train_task.outputs["adapter"],
        eval_score=eval_task.outputs["score"]
    )
    push_task.after(eval_task)

For multi-GPU training stages (70B+ models, or 8B with aggressive throughput targets), see the NCCL tuning guide for environment variable configuration and topology-aware all-reduce setup.

Cost-Aware Scheduling: Mixing On-Demand and Spot GPUs Across Pipeline Stages

Pipeline stages fall into three categories based on preemption tolerance:

Always spot: Long-running training steps with checkpoint recovery. These are the most expensive steps and get the largest savings from spot pricing. A 4-hour H100 SXM5 training step costs $3.20 on spot vs $12.40 on-demand, a 74% cost reduction that turns multi-hour fine-tuning jobs into a fraction of the on-demand price.

Sometimes spot: Eval steps where rerunning costs under 1 hour. If the eval step is preempted, you lose 45 minutes of compute. Whether that is acceptable depends on how often your spot pool is preempted and whether your pipeline has a hard time constraint.

Never spot: Registry push, metadata writes, final aggregation steps, and anything that writes to an external system. These steps are short, cheap on on-demand, and have side effects that are hard to safely retry.

Pricing comparison for GPU nodes (rates as of 04 May 2026):

GPUOn-DemandSpotSavingsBest For
H100 SXM5$3.10/hr$0.80/hr74%LoRA fine-tuning, SFT
H200 SXM5$2.51/hr$1.19/hr53%70B+ FSDP training
A100 80GB PCIe$1.04/hr$1.14/hrn/a*Eval, inference steps
B200 SXM6N/A$2.12/hrN/ALatest-gen training
RTX 4090$0.53/hrN/A-Lightweight eval

*A100 80GB PCIe spot ($1.14/hr) is higher than on-demand ($1.04/hr) at current Spheron rates. This is accurate, not a typo. Reservation pressure can make spot more expensive than on-demand for popular GPU models. Verify live pricing before building cost assumptions around A100 spot.

Pricing fluctuates based on GPU availability. The prices above are based on 04 May 2026 and may have changed. Check current GPU pricing for live rates.

In Kubeflow Pipelines, you can implement cost-aware routing by passing a pipeline parameter for estimated step duration and using dsl.If to route short steps to on-demand nodes and long steps to spot. This is more complex to implement than static node selectors, but it lets the pipeline self-select the cheaper path for variable-length training runs.

Integrating with vLLM, Triton, and Ray Serve for Auto-Deploy After Training

A natural extension of the 4-stage pipeline is a conditional deploy step: if the eval score clears a threshold, automatically push the model to a serving endpoint.

python
# Kubeflow: conditional deploy after eval threshold (KFP SDK v2)
# dsl.Condition is replaced by dsl.If in KFP SDK v2
from kfp import dsl
from kfp_kubernetes import add_node_selector_constraint

with dsl.If(eval_task.outputs["score"] > 0.75, name="deploy-condition"):
    deploy_task = vllm_deploy_component(
        model_path=push_task.outputs["registry_path"]
    )
    deploy_task.set_accelerator_type("nvidia.com/gpu").set_accelerator_limit(1)
    add_node_selector_constraint(deploy_task, label_key="workload", label_value="inference")
    deploy_task.after(push_task)

For vLLM: the deploy step runs vllm serve <merged-model-path> on an on-demand H100 SXM5 node. Use an on-demand node for inference, not spot, since serving interruptions affect live traffic. H100 SXM5 on Spheron supports both PCIe and SXM5 variants depending on your memory bandwidth requirements.

For Triton: the deploy step converts the model to TensorRT-LLM engine format, pushes the engine to the Triton model repository, and triggers a model reload via the Triton management API. The conversion step is GPU-intensive and should run on a separate on-demand node before the serving step.

For Ray Serve: use serve.run() with a vLLM deployment class to update a running endpoint in-place. Ray Serve handles the rolling update to avoid dropping requests during the model swap. The pipeline step just calls the Serve HTTP API to trigger the update.

The key constraint: keep the deploy step on an on-demand node and treat it as a "never spot" step. A preemption during model upload or Triton reload leaves the serving endpoint in an inconsistent state.

When to Pick Each: Decision Matrix by Team Size, Stack, and Workload

Team TypeOrchestratorWhy
Solo researcher / small data science teamMetaflowMinimal ops, excellent local dev, GPU via decorator
Platform/MLOps team building shared infra for multiple ML teamsKubeflow PipelinesPer-Op GPU control, shared cluster, UI for experiment tracking
Team currently on SageMaker Pipelines migrating to neocloudZenMLCode reuse, pluggable backends, fast migration path
Team wanting to avoid Kubernetes knowledge requirementZenML or MetaflowBoth hide most K8s complexity behind Python APIs
Team needing complex DAG branching and custom UIKubeflowdsl.If, dsl.ParallelFor, full KFP UI
Team needing fastest iteration cycle (dev to cloud in minutes)MetaflowLocal execution + @kubernetes = minimal friction

Metaflow works best for data-science-led teams where the ML engineers write training code and the platform team is small or non-existent. The decorator model fits naturally into notebooks and scripts. The main limitation is DAG flexibility: Metaflow's branching and joining is less expressive than KFP's DSL.

Kubeflow Pipelines is the right call when you have a platform team maintaining shared MLOps infrastructure for multiple ML teams. The investment in KFP setup pays off when 5-10 teams are running pipelines on the same cluster. GPU resource control per Op is the strongest of the three tools, which matters when you have heterogeneous node pools with different GPU models.

ZenML makes the most sense for teams with existing pipelines on SageMaker or Vertex that need to move to neocloud without a rewrite. ZenML's orchestrator abstraction means you can migrate step-by-step, running some steps locally and some on Kubernetes while you port the rest. For budget-sensitive teams doing evaluation work, rent A100 on Spheron for eval steps while keeping training on H100 SXM5 spot.

Summary

Kubeflow, ZenML, and Metaflow each solve a different version of the MLOps orchestration problem. Kubeflow is the most powerful and most demanding to operate. Metaflow is the simplest with the best developer experience. ZenML sits in the middle with the best migration story from managed services. For GPU cloud deployments specifically, all three can target spot nodes for training steps and on-demand nodes for eval and deploy, which is the main cost lever.


ML pipeline orchestration on GPU cloud is a real alternative to managed services - you trade some ops overhead for full control over scheduling, pricing, and tooling. Spheron provides the underlying GPU compute layer: spot H100 and H200 nodes for training stages, on-demand A100 nodes for eval, and NVLink multi-GPU nodes for distributed training DAGs.

Rent H100 → | Rent A100 → | View all GPU pricing →

Start a pipeline on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.