Engineering

Federated Learning on GPU Cloud: Deploy Flower, NVIDIA FLARE, and OpenFL for Privacy-Preserving AI Training (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 16, 2026
Federated Learning GPU CloudFlower Framework DeploymentNVIDIA FLARE TutorialOpenFL Federated LearningPrivacy Preserving LLM TrainingFederated LoRA Fine-TuningDifferential Privacy MLEU AI Act Data Governance
Federated Learning on GPU Cloud: Deploy Flower, NVIDIA FLARE, and OpenFL for Privacy-Preserving AI Training (2026 Guide)

Regulators are forcing the data residency question, and "just send us the data" is no longer a viable answer for healthcare, finance, and public sector AI teams. GDPR enforcement actions, HIPAA guidance on clinical AI, and the EU AI Act's Article 10 data governance requirements have all converged on the same constraint: training data cannot freely cross jurisdictional boundaries. Federated learning is the practical response to that constraint. This post covers everything you need to deploy a production FL pipeline in 2026: framework selection, hands-on Flower 1.x configuration for federated LoRA fine-tuning of a 7B model, secure aggregation with differential privacy, GPU sizing, network design, and cost analysis across Spheron's multi-region pod infrastructure.

This is not a survey of federated learning theory. It is an engineering guide for teams that have hit the data residency wall and need to actually run federated fine-tuning across distributed GPU pods. For teams using confidential GPU computing for inference on regulated data, FL is the complementary technique for the training side: CC mode protects inference inputs; FL eliminates the need to centralize training data in the first place.

Why Federated Learning Is Back

The first wave of federated learning hype (2017-2020) was about training on-device models across millions of phones without uploading raw data to Google's servers. The technology worked but the use cases were narrow: next-word prediction, keyboard autocomplete, wake-word detection on constrained hardware. Researchers published benchmark results on MNIST and CIFAR-10. Most practitioners tuned it out.

The 2026 wave is different. The use cases are real and the legal pressure is specific:

EU AI Act Article 10 requires that training data for high-risk AI systems be subject to data governance practices ensuring data quality and relevance. For cross-border healthcare AI, this creates a direct tension: pooling patient data across EU member states for centralized training requires data transfer agreements, legal reviews, and data processing addendums with every participating hospital. That process takes months. FL sidesteps the centralized transfer problem by keeping raw data at each site.

HIPAA AI guidance from HHS (2024-2025) clarified that covered entities sharing de-identified patient data for AI training still carry residual re-identification risk obligations. Clinical teams in the US have responded by demanding training architectures that do not require data to leave their institutional boundary. FL is the answer most procurement officers will accept. For the full regulatory compliance infrastructure, including audit logging and risk classification, see the EU AI Act compliance guide.

China's DSL (Data Security Law) and India's DPDP Act both impose significant restrictions on cross-border transfer of "important data," which regulators are interpreting broadly enough to include training datasets for AI models in regulated verticals. Multi-national teams training on data generated in China or India need FL or they need a local cluster per jurisdiction.

The LLM fine-tuning shift. The technical reason FL is viable in 2026 when it was not in 2019 is that the dominant fine-tuning method is now LoRA, not full parameter updates. A 7B model has 7 billion parameters, but a LoRA adapter with rank 16 has roughly 4-20 million trainable parameters. Per-round communication drops from 14 GB (full BF16 model) to 8-40 MB (LoRA delta). That makes FL practical over standard internet connections between cloud pods.

Flower 1.10 (July 2024) introduced the ClientApp interface, which decouples transport from training logic and fixed gRPC channel reconnection after client dropout. FLARE 2.6 added streaming-based model transfer via native tensor transfer and object container streaming, reducing memory overhead for large model updates. These are the two framework developments that made production FL deployments more stable through 2025 and into 2026.

FL Architecture Decision: Horizontal, Vertical, or Federated Fine-Tuning

Horizontal Federated Learning

Horizontal FL is the most common pattern. All clients share the same feature space and model architecture but have different data samples. Four hospitals each have patient records with the same schema (age, vitals, diagnoses, lab results), but different patients. Each trains a local model on its private dataset, then sends model updates (not data) to the aggregator. The aggregator merges the updates using FedAvg or a variant, then broadcasts the updated global model back to clients for the next round.

This is the pattern most FL deployments use. It maps naturally onto any scenario where multiple organizations each own a slice of a shared data distribution: hospital networks for clinical AI, retail chains for demand forecasting, regional banks for fraud detection.

Vertical Federated Learning

Vertical FL applies when different organizations hold different features for the same entities. A bank has transaction history; a telecom has call patterns; a retailer has purchase behavior. They want to jointly train a model using all three feature sets without any party revealing its raw features to the others.

Vertical FL requires Private Set Intersection (PSI) protocols to identify common entities across parties without exposing the full ID space. PSI is cryptographically complex, and implementing it incorrectly creates a real security failure: a poorly implemented PSI can leak entity membership information. OpenFL's PSI module provides a vetted implementation, but it adds significant operational overhead. Vertical FL is rarely the right architecture for LLM fine-tuning, where the model input structure is homogeneous across sites. It appears in feature-rich tabular prediction tasks, not text generation.

Federated Fine-Tuning of LLMs

The practical pattern for 2026 is federated fine-tuning: all clients share a frozen base model (Llama 4, Qwen 3, Mistral), each site fine-tunes LoRA adapters and related PEFT methods on its local private data, and the aggregator merges the adapter weights using FedAvg. The frozen base model weights never move, only the LoRA delta updates.

This pattern works because LoRA adapters are tiny relative to the base model, and all sites can agree on the same base model checkpoint from a public registry (HuggingFace Hub). No proprietary base model transfer is needed; only the site-specific fine-tuning knowledge (encoded in the LoRA adapters) is shared.

ArchitectureData StructureTypical Use CaseRecommended Framework
Horizontal FLSame schema, different samplesHospital networks, multi-branch retail, multi-bank fraudFlower 1.x, NVIDIA FLARE
Vertical FLDifferent features, same entitiesCross-industry tabular predictionOpenFL (with PSI module)
Federated fine-tuningShared base model, private text dataClinical LLM, legal AI, multi-jurisdiction NLPFlower 1.x (LoRA-native), NVIDIA FLARE

Framework Comparison: Flower, NVIDIA FLARE, and OpenFL

FrameworkAggregation StrategiesSecure Aggregation Built-inAdmin InterfaceTarget WorkloadLicense
Flower 1.xFedAvg, FedProx, FedAdam, FedMedian, customNo (bring Opacus/PySyft)CLI + custom callbacksAny PyTorch/JAX modelApache 2.0
NVIDIA FLARE 2.6FedAvg, FedProx, SCAFFOLD, cyclicYes (SA module built-in)FLARE admin console + RESTEnterprise, HIPAA workloadsApache 2.0
OpenFLFedAvg, FedProx, customPartial (workspace-based)Workspace CLIHealthcare, TensorFlow, IntelApache 2.0

Flower 1.x

Flower (Fast Library for Federated Learning) is the Python-native choice. You implement a NumPyClient subclass on each client, define a Strategy on the server (FedAvg is one line), and the framework handles gRPC transport, round orchestration, and metric aggregation.

Flower 1.10 (July 2024) introduced the ClientApp interface, which decouples transport from training logic and fixed gRPC channel reconnection after client dropout. Subsequent releases through 1.29 (April 2026) refined backpressure handling for slow clients and improved metrics aggregation. The older start_numpy_client() call still works but is deprecated; new deployments should use ClientApp and the flwr CLI runner.

Flower does not bundle secure aggregation or differential privacy. You add Opacus for DP noise injection at the client and optionally PySyft's SecAgg protocol for cryptographic secure aggregation. This gives flexibility but means more integration work compared to FLARE.

NVIDIA FLARE 2.6

FLARE (Federated Learning Application Runtime Environment) is the enterprise-grade choice. It ships with a built-in Secure Aggregation module, an admin console for round monitoring and job management, a PKI infrastructure for mTLS between clients and server, and HIPAA-grade audit logging out of the box.

FLARE 2.6 added streaming-based model transfer via native tensor transfer (PyTorch tensors transmitted without serialization overhead) and object container streaming (incremental model transmission), reducing bandwidth and memory overhead by 30-60% for full model weight sharing scenarios. For LoRA-only FL, the bandwidth savings are less meaningful since LoRA deltas are already small, but the streaming transfer helps when clients send dense gradients.

The cost of FLARE's enterprise features is operational overhead. The FLARE admin console requires a separate provisioning step, PKI setup takes time on first deployment, and the NVFLARE job runner has a steeper learning curve than Flower's script-based interface. For teams that need audit trails for a compliance review, FLARE's built-in logging is worth the overhead. For a research team running FL experiments, Flower is faster to iterate with.

OpenFL (Intel/Linux Foundation)

OpenFL is Intel's federated learning framework, now under the Linux Foundation. It targets healthcare and supports both PyTorch and TensorFlow, which makes it relevant for teams still running TF-based clinical models.

OpenFL's workspace-based configuration model organizes collaborators (clients) and directors (servers) around an experiment workspace. Intel OpenVINO integration enables efficient inference on Intel Xeon CPUs and Gaudi accelerators after federated training. The community is smaller than Flower's, and the documentation assumes familiarity with Intel's toolchain.

Use OpenFL if you are already running on Intel Xeon + Gaudi nodes, need TensorFlow FL support, or require the PSI module for vertical FL. For pure PyTorch federated fine-tuning on GPU cloud, Flower 1.x is the faster path.

Hands-On: Federated LoRA Fine-Tuning of Llama 4 Across 4 GPU Pods

Provisioning the Pods

For this setup: 4 FL client pods and 1 aggregator pod. The aggregator handles weight averaging (CPU-bound), so it does not need a GPU. Each client trains LoRA adapters on its private dataset.

On Spheron, provision each pod independently using the region filter to place clients in different geographic locations. This is the key infrastructure property for data residency: each pod is isolated, and data on a pod never leaves that pod's storage. Only the LoRA adapter weights leave each client and go to the aggregator.

Pod sizing:

  • Aggregator: 8 vCPU, 32 GB RAM, no GPU required (FedAvg aggregation for 4 clients and 7B LoRA adapters takes under 10 seconds per round)
  • Each FL client: 1x A100 80G or H100 SXM5, depending on model size (see GPU sizing section below)

After provisioning, note the aggregator pod's public IP address. Every client will connect to this address on port 8080 (or your chosen gRPC port). Ensure the aggregator's firewall allows inbound gRPC on that port from all client IPs.

Flower Server Configuration

The aggregator runs a Flower server with a FedAvg strategy. min_fit_clients=3 means training proceeds if at least 3 of the 4 clients are available. This handles client dropout without stalling the entire training run.

python
import flwr as fl
from flwr.server.strategy import FedAvg

def weighted_average(metrics):
    """Aggregate training loss across clients (fit metrics)."""
    losses = [num_examples * m["loss"] for num_examples, m in metrics]
    total = sum(num_examples for num_examples, _ in metrics)
    return {"loss": sum(losses) / total}

def evaluate_weighted_average(metrics):
    """Aggregate evaluation loss across clients (evaluate returns only loss)."""
    losses = [num_examples * m["loss"] for num_examples, m in metrics]
    total = sum(num_examples for num_examples, _ in metrics)
    return {"loss": sum(losses) / total}

strategy = FedAvg(
    min_fit_clients=3,          # minimum clients needed to run a round
    min_available_clients=4,    # wait until 4 clients are registered
    min_evaluate_clients=3,
    fit_metrics_aggregation_fn=weighted_average,
    evaluate_metrics_aggregation_fn=evaluate_weighted_average,
)

fl.server.start_server(
    server_address="0.0.0.0:8080",
    config=fl.server.ServerConfig(num_rounds=20),
    strategy=strategy,
)

Flower Client (each pod)

Each client runs on its GPU pod. The client loads the frozen Llama 4 base model, wraps it with LoRA adapters via peft, and only transmits the LoRA weights each round. The frozen base model stays on the pod.

python
import flwr as fl
import torch
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import numpy as np

MODEL_NAME = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
AGGREGATOR_ADDRESS = "AGGREGATOR_PUBLIC_IP:8080"

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

class FederatedLoRAClient(fl.client.NumPyClient):
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        base_model = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
        self.model = get_peft_model(base_model, lora_config)
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        # Load local private dataset - never leaves this pod
        self.train_dataset = load_dataset("json", data_files="/data/local_train.jsonl")["train"]

    def get_parameters(self, config):
        """Return only LoRA adapter weights, not the frozen base model."""
        lora_weights = []
        for name, param in self.model.named_parameters():
            if param.requires_grad:  # LoRA params only
                # Cast to float32 first: numpy has no bfloat16 dtype
                lora_weights.append(param.data.cpu().float().numpy())
        return lora_weights

    def set_parameters(self, parameters):
        """Load aggregated LoRA weights from the server."""
        lora_params = [
            (name, param)
            for name, param in self.model.named_parameters()
            if param.requires_grad
        ]
        for (name, param), new_weights in zip(lora_params, parameters):
            param.data = torch.tensor(new_weights).to(dtype=param.dtype, device=self.device)

    def fit(self, parameters, config):
        self.set_parameters(parameters)

        trainer = SFTTrainer(
            model=self.model,
            args=SFTConfig(
                output_dir="/tmp/fl_round",
                num_train_epochs=1,
                per_device_train_batch_size=4,
                gradient_accumulation_steps=4,
                learning_rate=2e-4,
                fp16=False,
                bf16=True,
                logging_steps=10,
            ),
            train_dataset=self.train_dataset,
        )
        trainer.train()

        return self.get_parameters(config={}), len(self.train_dataset), {
            "loss": trainer.state.log_history[-1].get("loss", 0.0) if trainer.state.log_history else 0.0,
        }

    def evaluate(self, parameters, config):
        self.set_parameters(parameters)
        # Simple perplexity evaluation on a small held-out local set
        loss = 0.5  # placeholder; implement your eval loop here
        return float(loss), len(self.train_dataset), {"loss": float(loss)}

# Flower 1.x: use ClientApp rather than start_numpy_client (deprecated)
def client_fn(context):
    return FederatedLoRAClient().to_client()

app = fl.client.ClientApp(client_fn=client_fn)

if __name__ == "__main__":
    # Launch with the Flower CLI (recommended for Flower 1.x):
    #   flwr run . --run-config server-address=AGGREGATOR_PUBLIC_IP:8080
    # The ClientApp above (app) is the entry point; the flwr runner invokes client_fn automatically.
    # fl.client.start_client() is deprecated in Flower 1.x - do not use it.
    pass

Running a Round

Per-round bandwidth numbers for LoRA (rank 16, targeting q/v/k/o projections on a 7B model):

Transfer TypeSize per RoundNotes
Full BF16 model weights (7B)~14 GBNot used in LoRA FL
LoRA adapter (r=16, 4 target modules, 7B)~8-40 MBActual transmitted payload per client
LoRA adapter (r=16, 70B model)~80-200 MBStill well within 10 GbE capacity

With 4 clients each sending 40 MB per round, total aggregator ingress is 160 MB per round. At 1 Gbps (125 MB/s), that is under 2 seconds of network time. Standard 10 GbE is sufficient. You do not need InfiniBand for FL.

Secure Aggregation: DP, HE, and TEE-Backed FL

FL is not inherently private. Gradient inversion attacks (Geiping et al., 2020) demonstrated that training gradients can be used to reconstruct the input data, especially for small batch sizes. Sending LoRA deltas instead of raw gradients reduces but does not eliminate this risk. Production FL for healthcare and finance data needs additional privacy mechanisms on top of the basic architecture.

Differential Privacy

Differential privacy adds calibrated noise to client updates before they leave the client, providing a mathematical bound on how much information about any individual training sample can be inferred from the transmitted update. The privacy budget is expressed as epsilon: lower epsilon means stronger privacy but noisier updates and slower convergence.

For healthcare FL under HIPAA, target epsilon=3 to epsilon=5 with delta=1e-5. For financial or commercial data with lighter sensitivity requirements, epsilon=8 is a reasonable starting point. The relationship is not linear: moving from epsilon=8 to epsilon=3 typically costs 15-25% model quality at convergence, depending on dataset size.

python
from opacus import PrivacyEngine

# Per-client DP setup - runs on each FL client pod
privacy_engine = PrivacyEngine()
model, optimizer, train_dataloader = privacy_engine.make_private_with_epsilon(
    module=self.model,
    optimizer=optimizer,
    data_loader=train_dataloader,
    epochs=1,
    target_epsilon=5.0,
    target_delta=1e-5,
    max_grad_norm=1.0,  # clip norm; Flower's DifferentialPrivacyClientSideAdaptiveClipping
                         # can tune this automatically across rounds
)

In Flower, DifferentialPrivacyClientSideAdaptiveClipping wraps the server strategy and automatically adjusts clip norm across rounds based on the fraction of clients whose gradients were clipped. This is preferable to a fixed clip norm, which requires upfront tuning.

Homomorphic Encryption Tradeoffs

Homomorphic encryption (HE) allows the aggregator to compute the sum of encrypted client updates without decrypting them. The aggregation result, when decrypted by a shared key, matches what plaintext FedAvg would have produced. No party other than the clients can read individual updates.

The problem is compute overhead. HE on modern CPUs (Microsoft SEAL, OpenFHE) runs 100-1000x slower than plaintext arithmetic for the same operation. Aggregating LoRA updates from 4 clients takes microseconds in plaintext; it takes seconds to minutes under HE depending on the scheme and key size. HE is only viable today for very small models (sub-1B parameters) or when aggregating gradient summaries rather than full model weights. For 7B+ LoRA FL, HE is a research direction, not a production option in 2026.

TEE-Backed FL on Hopper Confidential Computing

A middle ground: run the aggregator inside a Hopper CC mode enclave. The aggregator decrypts and averages updates inside the TEE; even the cloud provider cannot read the client updates or the aggregated result. This provides cryptographic guarantees on the aggregation step without the compute overhead of HE.

The confidential GPU computing guide covers the full CC mode setup, attestation workflow, and KMS integration. One important caveat from that guide applies here: CC mode is not available on Spheron's on-demand GPU marketplace. Running a TEE-backed aggregator on Spheron requires a reserved commitment, where the data center partner enables CC mode at the BIOS/VBIOS level before handing over the instance. For standard (non-CC) aggregation, the on-demand marketplace works fine.

Network Design for Federated Learning

FL topology is fundamentally different from distributed training topology. In FSDP or DeepSpeed ZeRO-3, every GPU communicates with every other GPU via all-reduce collectives. The all-reduce pattern is the reason you need InfiniBand or RoCE for distributed training: 400 Gbps of bidirectional bandwidth, all moving simultaneously, all coordinated, all latency-sensitive.

FL is a star topology. Clients talk to the aggregator; they do not talk to each other. There are no all-reduce collectives. There is no collective synchronization barrier. Each client sends its LoRA delta to the aggregator independently, and the aggregator averages them after receiving all (or min_fit_clients) responses. The peak network event is the aggregator broadcasting the merged weights back to all clients simultaneously.

This means InfiniBand is the wrong tool for FL. The GPU networking guide covers the cases where InfiniBand or Spectrum-X is necessary; federated learning is explicitly not one of them. Standard 10 GbE between cloud pods is sufficient for LoRA FL at any model scale.

Bandwidth sizing table:

Model SizeLoRA RankLoRA Delta per RoundRounds/HourClient Egress
7Br=8~8 MB12~1.6 Mbps
7Br=16~40 MB12~8 Mbps
13Br=16~80 MB6~8 Mbps
70Br=8~100 MB2~4 Mbps
70Br=16~200 MB2~9 Mbps

Even a 70B LoRA adapter at r=16 with 2 rounds per hour requires under 10 Mbps of sustained egress per client. A standard 1 Gbps pod network has 100x headroom. Full model weight sync (non-LoRA FL) changes this calculation drastically: a 7B model in BF16 is 14 GB, and syncing every 5 minutes requires 373 Mbps sustained. That is why LoRA is the only practical option for LLM-scale FL.

Network security: All FL traffic should run over mTLS. NVIDIA FLARE includes a built-in PKI infrastructure that generates client certificates automatically during workspace provisioning. For Flower, you need to configure gRPC TLS manually:

python
# Flower server with TLS
fl.server.start_server(
    server_address="0.0.0.0:8080",
    config=fl.server.ServerConfig(num_rounds=20),
    strategy=strategy,
    certificates=(
        Path("ca.crt").read_bytes(),
        Path("server.pem").read_bytes(),
        Path("server.key").read_bytes(),
    ),
)

# Flower client with TLS (Flower 1.x ClientApp approach)
# Configure SSL in pyproject.toml under [tool.flwr.federations.<name>]:
#   address = "AGGREGATOR_IP:8080"
#   root-certificates = "ca.crt"
# Then run: flwr run . <federation-name>

GPU Sizing for Federated Learning

Client-Side VRAM for Federated LoRA

Each FL client trains LoRA adapters on the frozen base model. VRAM usage at the client includes:

  • Frozen base model weights (BF16): no gradient accumulation needed, so memory usage is roughly model_params x 2 bytes
  • LoRA adapter weights (trainable): 4 x r x hidden_dim x 2 bytes per target module, multiplied by the number of layers and modules
  • LoRA gradients: same size as LoRA weights
  • Optimizer states for LoRA only (AdamW requires 2x the trainable parameter count in FP32)
  • Activation memory for forward/backward passes
Model SizeLoRA RankRecommended GPUVRAM Usage
7Br=16RTX 4090 (24 GB) or A100 40G~18-22 GB
13Br=16A100 40G (40 GB)~30-36 GB
13Br=32A100 80G (80 GB)~42-50 GB
70Br=8A100 80G SXM4 or H100 SXM5~68-76 GB
70Br=16H100 SXM5 (80 GB)~76-80 GB with gradient checkpointing

Gradient checkpointing on the LoRA parameters (not the frozen layers) can reduce activation memory by 30-40% at the cost of a second forward pass per backward pass. For 70B models, this is necessary to fit within 80 GB even with only LoRA parameters as trainable.

Server-Side Aggregation Throughput

The aggregator runs FedAvg: compute the weighted average of client LoRA weight tensors. This is pure CPU arithmetic on small arrays. A 16-vCPU aggregator handles FedAvg for 8 clients and a 7B LoRA adapter (r=16, ~40 MB per client) in well under 10 seconds per round. No GPU is required for aggregation unless you are running homomorphic encryption (which requires specialized CPU or GPU acceleration) or running a TEE-backed aggregator in CC mode.

For NVIDIA FLARE with streaming-based model transfer enabled, add one GPU to the aggregator to run the decompression and aggregation pipeline faster. This matters if you have many clients (16+) sending large updates simultaneously.

Full Fine-Tuning vs LoRA in FL Context

Full fine-tuning in an FL context transmits the entire model every round. For a 7B BF16 model, that is 14 GB per client per round. With 4 clients and 2 rounds per hour, the aggregator handles 112 GB/hr of inbound traffic. This is technically possible on a 100 GbE connection but is wasteful and slow. Full fine-tuning also requires clients to have enough VRAM for the full optimizer state of all 7B parameters, which pushes the hardware requirement to H100 or A100 80G even for a 7B model.

LoRA is not a compromise in the FL context. It is the right architecture choice: smaller gradient updates, less VRAM per client, faster rounds, and practically no quality gap for domain adaptation tasks.

Pricing for FL client GPUs (Spheron live rates as of 16 May 2026):

GPUOn-Demand $/hrSpot $/hrCost per FL Round (5 min)Typical FL Use
A100 80G SXM4$1.71$0.45~$0.143 on-demand / ~$0.038 spot13B-70B FL client
H100 SXM5$3.90$1.66~$0.325 on-demand / ~$0.138 spot70B FL client, large aggregator
H200 SXM5$4.62$1.92~$0.385 on-demand / ~$0.160 spot70B+ FL client

For a 4-client FL setup training a 70B model with 2 rounds per hour over 10 hours (20 rounds), using A100 GPU rental at spot pricing of $0.45/hr: 4 clients x 10 hours x $0.45 = $18.00 total compute cost. At on-demand rates ($1.71/hr), the same run is 4 x 10 x $1.71 = $68.40. The aggregator (CPU pod) adds roughly $2-4 either way.

Pricing fluctuates based on GPU availability. The prices above are based on 16 May 2026 and may have changed. Check current GPU pricing → for live rates.

For workloads that fit on smaller clients, L40S for cost-efficient FL clients is a good option for 7B and 13B models.

Production Patterns: Dropout, Stale Gradients, Byzantine Robustness

Handling Client Dropout

In a centralized training run, a failed GPU is a hard failure. In FL, client dropout is expected. Networks go down, pods restart, maintenance windows hit at inconvenient times.

The primary mitigation is setting min_fit_clients below the total client count. In the Flower server config above, min_fit_clients=3 means 3 of 4 clients must respond before the round proceeds. The 4th client, if it comes back online, rejoins the next round.

For NVIDIA FLARE, the client staleness tolerance config allows clients that missed a round to contribute updates from a previous round, weighted by a staleness decay factor. This is appropriate for asynchronous FL where client training times vary significantly. Synchronous FL (all clients finish before aggregation) is simpler to reason about but sensitive to the slowest client. Async FL is more complex to tune but tolerates high variance in client training times.

Stale Gradient Mitigation

In asynchronous FL, a client that finishes training on an old global model version and submits its update creates a stale gradient problem. Its update represents a gradient computed at round t-k but submitted at round t. Applying it directly as if it were a round-t gradient introduces noise.

Flower's FedAsync strategy implements time-decayed weighting: a client update from k rounds ago is weighted by alpha^k where alpha < 1. Typical values are alpha=0.9 to alpha=0.5. The decay rate is a hyperparameter: too aggressive and you throw away useful client updates; too gentle and stale gradients accumulate and slow convergence.

Byzantine-Robust Aggregation

FedAvg is not robust to adversarial clients. A single client submitting poisoned gradient updates (scaled adversarially to push the global model toward a backdoor) can compromise the entire federated model. This is a realistic attack vector when FL spans organizations that do not fully trust each other.

Byzantine-robust aggregation strategies replace FedAvg's mean with a robust estimator:

  • FedMedian: take the coordinate-wise median across client updates instead of the mean. More robust than mean but loses the weighted averaging property.
  • Krum: select the client update closest to its k-nearest neighbors in gradient space. Assumes fewer than f = (n-2)/2 malicious clients out of n total.
  • Multi-Krum: a generalization of Krum that selects the top-m closest clients, balancing robustness and convergence speed.

Both Krum and Multi-Krum are available as Flower strategy subclasses. For FL deployments where clients represent independent organizations with adversarial incentives (competitive industry consortia, open FL networks), at minimum use FedMedian. For healthcare FL where all clients are vetted institutions, FedAvg with DP is sufficient.

Compliance Walkthrough: EU AI Act Article 10 Mapping

Article 10 of the EU AI Act imposes data governance requirements on training data for high-risk AI systems. FL satisfies the cross-border transfer prohibition but does not eliminate the governance obligation at each participating site. Every FL client is still subject to Article 10 at its location.

EU AI Act RequirementFL MechanismImplementation Detail
Training data does not cross bordersFL keeps raw data localClient data never leaves the pod; only LoRA weights are transmitted
Data quality and relevance checksLocal validation at each clientAdd a data validation step before each training round (schema checks, outlier detection)
Training data documentationPer-client data cardsUse structured data cards at each site: source, license, filtering applied, date range
Model audit trailAggregation round logsNVIDIA FLARE's built-in audit log or a custom Flower FitMetricsAggregationFn that logs round metadata
Human oversight for high-risk systemsRound approval workflowsFor medical AI, add a round approval gate where a human reviews aggregation metrics before broadcasting the new model

The governance obligation at each client site means FL does not eliminate the compliance workload. It redistributes it: instead of one team governing a central dataset, each participating site governs its local data. For healthcare AI across 10 hospital sites, that means 10 separate data governance reviews. FL makes the legal review simpler (no data transfer agreement needed), but it does not make the data governance obligation disappear.

For the full regulatory compliance stack beyond data governance, including risk classification, technical documentation requirements, and audit infrastructure, see the EU AI Act compliance guide.

Cost Comparison: Federated vs Centralized Training

Scenario: 4 healthcare organizations, each with 50,000 clinical records, want to jointly fine-tune a 13B medical language model.

Centralized option: One organization pools all data from the other three. This requires data transfer agreements between all four parties, IRB amendments at each institution, legal review of cross-border transfer requirements (if the organizations span EU member states), and de-identification or pseudonymization of records before transfer. Conservative estimate: 3-6 months of legal and compliance work before training starts, plus egress costs for ~200 GB of data across institutional boundaries.

On the compute side, centralized training on a single A100 80G runs a 13B LoRA fine-tune on 200K records in roughly 8-12 hours. At $1.71/hr for the A100 GPU rental on Spheron, the raw compute cost is $14-$21. The legal overhead dwarfs the compute cost.

Federated option: 4 client pods (A100 80G, $0.45/hr each spot) training locally for 20 rounds, with a CPU aggregator. Each round trains for ~15 minutes; total training wall time is ~5 hours. Total compute: 4 x 5 hours x $0.45 = $9.00 spot pricing (or $34.20 at on-demand rates of $1.71/hr). No data transfer agreements needed. Each institution starts training immediately after provisioning their pod.

Pricing fluctuates based on GPU availability. The prices above are based on 16 May 2026 and may have changed. Check current GPU pricing → for live rates.

Egress cost comparison: Hyperscalers charge $0.08-$0.12/GB for data egress. Centralizing 200 GB from 3 organizations costs $16-$24 in egress alone, and that is before the legal work. FL clients transmit LoRA deltas: ~80 MB per client per round, 20 rounds, 3 sending clients = ~4.8 GB total egress. At $0.08/GB that is $0.38 in egress. Spheron charges no additional egress fee on training data that stays local to the pod; you pay only for the pod-to-pod LoRA weight transfers.

Break-even analysis: FL's additional compute overhead (4 pods instead of 1) costs $7-10 extra. The legal overhead for data centralization is typically 10-100x higher in calendar time and staff cost. FL becomes cost-advantageous the moment any legal review is required. For regulated healthcare and financial data, that threshold is crossed in nearly every cross-institutional scenario.

Spheron's distributed marketplace model is structurally well-suited to FL: many small pods in different regions at transparent per-hour rates, no minimum commitment for most workloads, and the ability to spin down client pods between training rounds (you are not paying for idle reserved instances).

Summary

Federated learning is a production-grade technique in 2026 for LLM fine-tuning under data residency constraints. The use cases are real: HIPAA-constrained clinical AI, EU AI Act Article 10 compliance for cross-border healthcare, and jurisdictional data sovereignty requirements in APAC and MENA markets. LoRA adapters make FL practical at LLM scale by reducing per-round bandwidth from 14 GB to under 200 MB even for 70B models. Flower 1.x is the right framework for most teams; NVIDIA FLARE 2.6 is the right choice when audit trails, built-in secure aggregation, and an admin console are required. Spheron's multi-region pod model provides the infrastructure fit: independent pods in different geographic regions, each acting as a data residency boundary, at per-hour rates that make idle-time between rounds affordable.

Federated learning needs GPU pods spread across regions, not a single large cluster. Spheron's distributed marketplace provisions bare-metal pods by region, so each site stays within its data residency boundary while sharing only model updates.

Rent H100 → | Rent A100 → | View all GPU pricing →

STEPS / 06

Quick Setup Guide

  1. Choose your FL topology and framework

    Decide between horizontal FL (same schema, different data owners), vertical FL (different feature sets, same entities), or federated fine-tuning (shared base model, private data per site). Then pick Flower for flexible Python-native setups, NVIDIA FLARE for enterprise with built-in secure aggregation, or OpenFL for healthcare/Intel workloads.

  2. Provision GPU pods on Spheron for aggregator and clients

    Create one aggregator pod (CPU-heavy, minimal GPU) and one pod per FL client (GPU-sized for LoRA training). Use Spheron's region filter to place client pods in different geographic locations matching your data residency requirements. Note the IP of each pod for the FL network config.

  3. Configure the Flower server and client strategy

    On the aggregator, run flwr.server.start_server() with a FedAvg or FedProx strategy and set the server address to 0.0.0.0:8080. On each client pod, implement the fl.client.NumPyClient interface, load your local dataset, and connect to the aggregator using the aggregator's public IP.

  4. Apply LoRA adapters for federated fine-tuning

    Use peft.get_peft_model() with LoraConfig (r=16, lora_alpha=32) to wrap your base model before passing it to the FL client. Only LoRA adapter weights are transmitted each round - not the frozen base model weights. This reduces per-round bandwidth from 14 GB (full 7B model) to 8-40 MB.

  5. Add differential privacy for aggregation

    Enable Opacus-backed DP noise injection at each client with a target epsilon of 3-8 (epsilon=3 for healthcare, epsilon=8 for less sensitive data) and delta=1e-5. In Flower, wrap the strategy with DifferentialPrivacyClientSideAdaptiveClipping to handle clip norm automatically.

  6. Monitor training and handle client dropout

    Set min_fit_clients below your total client count so training continues if a client disconnects. Log per-round metrics (loss, accuracy, bandwidth) at the aggregator. NVIDIA FLARE's admin console provides built-in round monitoring; for Flower, add a custom FitMetricsAggregationFn.

FAQ / 05

Frequently Asked Questions

Standard distributed training (FSDP, DeepSpeed ZeRO) moves all data to a central cluster and shards the model across GPUs. Federated learning keeps data at each site - hospital, bank, regional office - and only shares model updates (gradients or weights). The aggregation server never sees raw training data. This is required when data cannot legally leave its jurisdiction, such as under GDPR, HIPAA, or China's DSL.

Flower 1.x is the default choice for Python-native teams: flexible, minimal boilerplate, and works with PyTorch or JAX without requiring a custom client SDK. NVIDIA FLARE 2.6 is the right pick for enterprise deployments that need built-in secure aggregation, an admin console, and HIPAA-grade audit trails - it adds significant operational overhead in exchange. OpenFL (Intel/Linux Foundation) targets healthcare specifically and integrates with Intel hardware optimizations; use it if you are already on Intel Xeon + Gaudi nodes or need TensorFlow FL support.

Bandwidth requirements depend on model size and update frequency. A LoRA adapter for a 7B model with rank 16 generates roughly 8-40 MB of gradient updates per round. With 4 clients sending one round per minute, total ingress at the aggregator is under 1 Gbps. Full model weight transmission (not LoRA) for a 7B model is 14 GB per round - that requires 100 Gbps or careful round scheduling. InfiniBand is not needed for FL: standard 10 GbE or 25 GbE between sites is sufficient for LoRA-based federated fine-tuning.

Federated learning addresses the training data residency concern: raw data never crosses regional boundaries, only model updates do. However, the EU AI Act (Article 10) also requires that data used for training be subject to appropriate quality and governance controls, which applies to the local data at each FL client site as well. FL reduces the attack surface but does not eliminate the governance obligation. Pair FL with differential privacy and audit logging at each client to address Article 10 holistically.

Yes. Spheron's distributed marketplace model provisions GPU pods independently across data center partners in multiple regions. You can run an FL aggregator on a dedicated pod and deploy FL clients on separate pods in different geographic regions. Each region's pod acts as a data residency boundary. Spheron's per-hour pricing makes this practical - you are not paying for reserved instances that sit idle between training rounds.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.