Decentralized LLM Training on GPU Cloud: Pluralis, Prime Intellect, and Nous Psyche (2026 Guide)

Three decentralized training protocols hit production-viable scale in 2026: Pluralis reached production-viable scale with their OpenDiLoCo protocol, Prime Intellect launched INTELLECT-2, and Nous Research released Psyche, coordinated via the Solana blockchain. If you are new to multi-node training fundamentals, read the multi-node FSDP training guide first. This post assumes you are already comfortable with single-datacenter distributed training and focuses on what changes when your nodes are separated by the public internet.

What Decentralized Training Solves

Large model pretraining requires compute that no single organization can afford at hyperscaler prices. A 70B model from scratch on AWS p5 instances runs $4-8M for a decent token budget. Even with reserved pricing, the cost keeps most research teams out of the pretraining game entirely.

Traditional distributed training adds another constraint: all GPUs must sit in the same datacenter on the same InfiniBand fabric. The all-reduce collective synchronizes gradients every step, and with 400Gb/s InfiniBand that's fast. But put those same GPUs in different datacenters and you have 10-100Gb/s internet links with 30-150ms latency. Standard data parallelism breaks down entirely.

DiLoCo (Distributed Low-Communication) solves this. Instead of synchronizing every gradient step, DiLoCo runs two optimization loops: an inner loop with H=500 local AdamW steps, and an outer loop that exchanges pseudo-gradients (the difference between current and starting weights after H steps) across nodes. The outer step uses SGD with Nesterov momentum. This reduces inter-node communication by 500x versus standard data parallelism while matching collocated training loss curves within 1-2%.

OpenDiLoCo extended this to asynchronous outer steps, letting stragglers catch up without blocking the whole cluster. The resulting synchronization frequency comparison:

Method	Sync frequency	Bandwidth needed	Latency tolerance
Standard data parallelism	Every step	Very high (400Gb/s+)	Sub-millisecond
DiLoCo	Every 500 steps	Moderate (10-100Gb/s)	Up to 150ms
OpenDiLoCo (async)	Per-node cadence	Low (1-10Gb/s)	500ms+
Federated averaging	Every epoch	Very low	Seconds

The 500-step cadence is what makes cloud GPU nodes across different datacenters viable training participants.

Protocol Comparison: Pluralis vs Prime Intellect vs Nous Psyche

Each protocol took a different approach to the same core problem.

Feature	Pluralis	Prime Intellect	Nous Psyche
Model size supported	1B-70B	1B-200B+	1B-30B
Communication protocol	OpenDiLoCo + gossip	DiLoCo + hivemind DHT	DisTrO via Solana
Gradient exchange	Signed pseudo-gradients	Aggregated outer steps	Compressed gradient shards
Fault tolerance	Byzantine-resistant (signed gradients)	DHT-coordinated node recovery	Solana-based consensus
Solana coordination	No	No	Yes (native)
Node entry requirements	40GB+ VRAM, 10Gb/s	80GB VRAM, 25Gb/s	40GB+ VRAM, 10Gb/s
Training objective	Pretraining	Pretraining + fine-tuning	Pretraining

The architectural differences matter for choosing the right protocol for your use case.

Pluralis uses signed gradients for Byzantine fault tolerance. Every pseudo-gradient update is cryptographically signed with the contributing node's private key. The aggregator verifies each signature and rejects submissions that fail verification or fall statistically outside the expected gradient distribution. This makes Pluralis the hardest to game and the most suitable for open, permissionless contributor pools where you don't know or trust all participants.

Prime Intellect uses hivemind DHT for peer coordination. Hivemind is a battle-tested peer-to-peer library developed at Yandex Research specifically for internet-distributed training. Each node registers itself with a distributed hash table, and the DHT coordinates outer step scheduling, shard assignment, and recovery when nodes drop. INTELLECT-2 is a 32B reinforcement learning run (GRPO on a reasoning model), building on the earlier INTELLECT-1 pretraining work that demonstrated coordinated training across multiple datacenters.

Nous Psyche coordinates training through the Solana blockchain. Each contributing node is identified by a Solana wallet keypair. Psyche's consensus mechanism scores each node's gradient contributions and rewards participating nodes. This makes Psyche the only protocol that provides direct economic incentives for participation, but it also adds the requirement to hold a Solana wallet and meet the run's authorization requirements.

Hardware Requirements Per Node

The bandwidth requirement is the most important constraint for decentralized training, more so than raw GPU compute.

Protocol	Min VRAM	Rec VRAM	Min Bandwidth	Latency Tolerance
Pluralis	40GB	80GB	10 Gb/s	150ms
Prime Intellect	80GB	80-160GB	25 Gb/s	100ms
Nous Psyche	40GB	80GB	10 Gb/s	200ms

The bandwidth requirement comes from the outer-step exchange. After H local steps, each node broadcasts its pseudo-gradient vector. For a 7B model in BF16, the pseudo-gradient is ~14GB. Across 4 nodes with OpenDiLoCo's compressed exchange, you need ~3-5GB transmitted per outer step. At 25Gb/s that's about 1-1.6 seconds per outer step. At 10Gb/s it stretches to 4-5 seconds, adding roughly 10-15 minutes per 1,000 outer steps to total training time.

Within a node, NVLink still handles intra-node communication at full speed. An 8xH100 SXM5 node uses NVLink at 900GB/s for all-reduce operations across its 8 GPUs. Only the cross-node outer-step exchange uses the internet connection. So your intra-node parallelism strategy (TP=8 within the node) is unchanged. The decentralized layer sits above normal FSDP or tensor parallel training.

Deploy a Prime Intellect Training Node on Spheron

This covers joining an active Prime Intellect pretraining run (INTELLECT-1 style, using the OpenDiLoCo pretraining stack). Note: INTELLECT-2 is a reinforcement learning run on a 32B reasoning model, not pretraining. The setup below targets the pretraining workflow. Provision an H100 SXM5 instances on Spheron with 8x H100 80GB GPUs, at least 25Gb/s egress bandwidth, and 512GB system RAM (for optimizer state offloading during outer steps).

Install dependencies:

bash

# Python 3.11+, CUDA 12.1+
pip install prime-intellect hivemind torch==2.3.0+cu121 transformers accelerate bitsandbytes

Create your training config (config.yaml):

yaml

model:
  name: "meta-llama/Llama-3-8B"
  max_length: 2048

optimizer:
  inner:
    type: "adamw"
    lr: 1.0e-4
    weight_decay: 0.1
    betas: [0.9, 0.95]
  outer:
    type: "nesterov_sgd"
    lr: 0.7
    momentum: 0.9

diloco:
  inner_steps: 500
  sync_timeout_seconds: 120
  max_stale_steps: 2

data:
  dataset: "HuggingFaceFW/fineweb-edu"
  tokens_per_step: 524288
  num_workers: 8

distributed:
  dht_bootstrap_peers:
    - "/ip4/34.172.12.109/tcp/12345/p2p/QmBootstrap1"
    - "/ip4/35.226.44.18/tcp/12345/p2p/QmBootstrap2"
  listen_port: 12346
  announce_ip: "YOUR_PUBLIC_IP"

checkpoint:
  dir: "/mnt/storage/checkpoints"
  save_every_outer_steps: 10
  keep_last: 5

Launch the training node:

bash

# Set your public IP (critical for peer discovery)
export PUBLIC_IP=$(curl -s ifconfig.me)
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=2

# Start the training process
torchrun --nproc_per_node=8 \
  --master_addr=localhost \
  --master_port=29500 \
  prime_intellect.train \
  --config config.yaml \
  --node_id $(hostname) \
  --public_ip $PUBLIC_IP

# Monitor outer step progress
prime_intellect monitor --dht_peer /ip4/34.172.12.109/tcp/12345/p2p/QmBootstrap1

The node registers with the DHT, receives a data shard assignment, and starts its first inner loop. You will see logs like [outer_step=142] pseudo_grad_norm=0.0234, participants=18, sync_latency=143ms. The sync latency reflects your cross-region network round-trip. For Prime Intellect, keep sync latency under 100ms; values above 150ms will start to degrade outer step convergence. OpenDiLoCo and Pluralis tolerate up to 150ms; Nous Psyche tolerates up to 200ms.

Deploy a Pluralis Training Participant

Pluralis uses a gossip protocol instead of a DHT. Peer discovery is faster but shard assignment is done by a lightweight coordinator service rather than a fully decentralized DHT. The signed-gradient protocol adds ~2ms overhead per outer step on modern CPUs (Ed25519 signing).

Clone and configure:

bash

# Pluralis Research publishes its training stack at https://github.com/PluralisResearch
# Check that org for the current training-participant client (e.g. node0) and clone it.
# The CLI commands below reflect the published protocol design; consult the client's README for current command names.

# Generate your node keypair (keep the private key secure)
pluralis keygen --output ~/.pluralis/keypair.json

Node configuration (pluralis-node.toml):

toml

[node]
keypair = "~/.pluralis/keypair.json"
vram_allocation_gb = 76  # leave 4GB for OS/NCCL buffers on 80GB H100
bandwidth_limit_gbps = 20

[training]
model_size = "8b"
inner_optimizer = "adamw"
inner_lr = 1e-4
outer_optimizer = "sgd_nesterov"
outer_lr = 0.7
inner_steps = 500
gradient_compression = "quantized_bf16"

[coordinator]
url = "<your-coordinator-url>"
heartbeat_interval_seconds = 30

[checkpoint]
backend = "s3"
bucket = "your-checkpoint-bucket"
prefix = "pluralis-runs/"
save_every_outer_steps = 10

Start the participant:

bash

# Command names below are illustrative. Check the client README for the current CLI surface.
pluralis run \
  --config pluralis-node.toml \
  --gpu_ids 0,1,2,3,4,5,6,7 \
  --run_id intellect-8b-run-12

Each outer step, Pluralis signs the pseudo-gradient with your private key, broadcasts it to the gossip network, and the coordinator verifies signatures from all participants before computing the aggregated outer step. If your gradient fails verification (due to hardware error, incorrect shard, or intentional manipulation), it's excluded from the aggregation and your node is flagged. After 3 consecutive failures, you're temporarily excluded from the run.

For checkpoint configuration on spot nodes, async checkpoint offload, optimizer state preservation, and self-healing job controllers all work well alongside Pluralis' run recovery.

Deploy a Nous Psyche Node

Psyche coordinates training through the Solana blockchain. Unlike Pluralis and Prime Intellect, participation can earn rewards based on the quality of your gradient contributions, making it the only protocol with direct economic incentive alignment.

Psyche uses DisTrO (Distributed Training Over-the-Internet), a compressed gradient scheme that reduces outer-step bandwidth by 10-20x versus standard pseudo-gradient exchange. DisTrO applies top-k sparsification to the pseudo-gradient: only the k largest-magnitude components are transmitted, and the residual error is accumulated locally for the next outer step.

Prerequisites: Linux OS, NVIDIA GPU with drivers, Docker, and NVIDIA Container Toolkit.

Create a Solana wallet:

bash

# Install Solana CLI tools
sh -c "$(curl -sSfL https://release.solana.com/stable/install)"

# Generate a new wallet keypair
solana-keygen new --outfile ~/.config/solana/psyche-keypair.json

# Get your public key (needed for run authorization)
solana address

Configure your environment (.env):

bash

# Wallet and RPC configuration
WALLET_KEYPAIR_PATH=/root/.config/solana/psyche-keypair.json
SOLANA_RPC_URL=https://api.mainnet-beta.solana.com
RUN_ID=<run-id-from-psyche-dashboard>

Verify authorization and join a run:

bash

# Check if your wallet is authorized for the current run
run-manager can-join \
  --run-id $RUN_ID \
  --authorizer <AUTHORIZER_PUBKEY> \
  --address $(solana address)

# Start the training node (downloads correct Docker image automatically)
./run-manager --env-file /path/to/your/.env

Psyche's Solana-based consensus scores your contributions against other nodes. Low-contribution nodes scoring below the threshold may be excluded from reward distribution. Keep your node online consistently to maintain contribution quality. For larger model tasks (30B+), an H200 GPU rental provides the 141GB HBM3e needed to run the full model without sharding across nodes in a way that would hurt your inner step throughput.

Cost Comparison: Decentralized vs Single-Datacenter Training

The cost advantage of decentralized training comes from spot pricing flexibility across multiple providers. Because DiLoCo's 500-step sync cadence tolerates spot preemptions (each outer step is independently recoverable), you can use aggressive spot pricing that would break standard FSDP runs.

Live prices from Spheron as of 2026-06-04:

H100 SXM5 on-demand: $5.01/hr per GPU, spot: $1.49/hr per GPU
H200 SXM5 on-demand: $5.92/hr per GPU, spot: $3.31/hr per GPU
B200 SXM6 on-demand: $9.36/hr per GPU, spot: $5.34/hr per GPU

Config	7B / 100B tokens	13B / 150B tokens	70B / 500B tokens
Decentralized spot H100 (4 nodes x 8 GPUs)	~$1,190	~$2,980	~$21,400
Spheron H100 on-demand single-DC	~$4,000	~$10,000	~$72,000
Single-DC FSDP reserved cluster (est.)	~$2,100	~$5,200	~$38,000
AWS p5.48xlarge (8x H100, $98.32/hr)	~$7,870	~$19,700	~$141,000

For Spheron specifically, the H100 SXM5 spot rate ($1.49/hr per GPU) is roughly 3.4x cheaper than the on-demand rate ($5.01/hr per GPU). Running a decentralized job on spot instances instead of single-DC on-demand roughly triples your token budget for the same dollar spend. The spot GPU training guide covers how to make spot-based decentralized runs resilient to preemptions. Compared to AWS p5.48xlarge, Spheron spot saves over 85% on raw compute. Against self-managed reserved clusters, spot decentralized training saves 40-45%.

When factoring in InfiniBand reservation costs for the single-DC reserved option (typically $0.20-0.40/hr per GPU additional), decentralized training on internet-connected nodes avoids that overhead entirely.

For optimizing multi-provider spot purchasing across the decentralized training setup, see spot GPU arbitrage across providers for strategies that work well with DiLoCo's fault-tolerant outer step cadence.

Pricing fluctuates based on GPU availability. The prices above are based on 04 Jun 2026 and may have changed. Check current GPU pricing for live rates.

When Decentralized Training Fits (and When It Doesn't)

Decentralized training is the right call when:

Budget is under $50K for a pretraining run. At this budget, hyperscaler reserved clusters are out of reach, but 4-8 cloud GPU nodes running DiLoCo can train a quality 7B-13B model on meaningful token budgets.
You want to contribute to a public training run. Prime Intellect and Pluralis both operate open contributor pools. You add compute and receive access to the resulting model. No model ownership, but no infrastructure ownership cost either.
Your compute is sourced across multiple providers. If your team already has credits or commitments spread across multiple GPU clouds, DiLoCo lets you combine them into a single training run rather than choosing one provider.
You're comfortable with 15-20% slower wall-clock training. The sync overhead is real. Budget for it.

Single-datacenter FSDP is better when:

Wall-clock speed is critical. A production training deadline with a hard date needs the full InfiniBand bandwidth. DiLoCo's 500-step cadence adds 10-15% wall-clock time even under ideal conditions.
The model is 70B+ and requires tight pipeline parallelism. Gradient staleness in the DiLoCo outer step scales with model size. For 70B+ with long context, the accumulated error between outer steps starts to noticeably affect convergence. Single-DC with Megatron-Core 3D parallelism is more reliable.
Your training requires synchronized batch statistics. Some architectures (e.g., those using batch norm or synchronized layer norm) need cross-node batch statistics every step. DiLoCo's asynchronous outer steps are incompatible with this requirement.

For the intermediate case where you want multi-datacenter training without full internet latency constraints, see the guide on training across datacenter boundaries which covers RoCE, EFA, and high-speed Ethernet configurations.

Common Failure Modes and Mitigations

Five failure modes dominate decentralized training deployments in practice:

Straggler nodes are the most common. If one node is running on slower hardware or has a congested internet link, it holds up the outer step for the whole cluster. OpenDiLoCo mitigates this with a configurable timeout: the outer step proceeds once 80% of expected participants have submitted, and slow nodes are skipped for that round. Set sync_timeout_seconds: 120 in Prime Intellect's config. Pluralis' coordinator implements a similar majority-threshold policy.

Gradient staleness occurs when some nodes run more outer steps than others before synchronizing. This happens in async OpenDiLoCo runs when fast nodes proceed without waiting for slow ones. The staleness accumulates across outer steps. Mitigation: use hivemind's step-count tracking to limit staleness to at most max_stale_steps: 2 outer steps. Beyond that, force a synchronization barrier.

Malicious participants matter more for open contributor pools than private runs. Pluralis' signed-gradient verification and statistical outlier detection catches most adversarial submissions. Prime Intellect's aggregation filters out gradients whose norm deviates more than 3 standard deviations from the median across participants. Nous Psyche relies on Solana-based consensus to score and exclude poor contributors.

Outer-step divergence is subtle. If two nodes get out of sync (e.g., one restarted from an earlier checkpoint), their outer step counts diverge, and the pseudo-gradients they exchange become semantically inconsistent. Always save and restore both the model weights AND the outer step counter. Prime Intellect's checkpoint format includes the outer step count by default; Pluralis requires explicitly saving the run_state.json alongside the model weights.

Spot preemption is a cloud-specific failure unique to decentralized training on spot instances. When a cloud provider reclaims a spot GPU, the node disappears mid-run without warning. All three protocols handle this the same way: checkpoint at every outer DiLoCo step, then auto-restart the node from the latest checkpoint. Since outer steps are infrequent (every 500 local steps), the maximum lost work on preemption is one outer step. Prime Intellect's hivemind DHT coordinates automatic re-entry of a recovered node; Pluralis and Nous Psyche require restarting the client with the same run ID to rejoin the swarm.

Failure mode	Protocol	Mitigation
Straggler nodes	All	Majority-threshold sync timeout
Gradient staleness	Prime Intellect (async)	`max_stale_steps` cap in hivemind
Malicious participants	Pluralis, Psyche	Signed gradients + Solana-based scoring
Outer-step divergence	All	Checkpoint outer step counter alongside weights
Spot preemption	All	Checkpoint every outer step, auto-restart

Decentralized training protocols like Prime Intellect and Pluralis are designed for geographically distributed nodes. Spheron's on-demand and spot H100, H200, and B200 instances make it practical to join a training run or run your own swarm without a datacenter contract.
Check H100 availability | H200 on Spheron | View GPU pricing

STEPS / 03

Quick Setup Guide

Set up a Prime Intellect pretraining node on H100 (INTELLECT-1 style)
Install the prime-intellect client, configure your optimizer and batch size for DiLoCo outer steps, point your node at the Prime Intellect DHT bootstrap peers, and start the training process. The coordinator assigns your node a data shard and outer step schedule automatically.
Join a Pluralis swarm as a signed-gradient participant
Clone the Pluralis repo, configure your node's keypair for gradient signing, set your bandwidth and VRAM allocation, and run the swarm coordinator client. Pluralis handles shard assignment and peer discovery via its gossip protocol.
Deploy a Nous Psyche node with DisTrO optimizer
Set up a Solana wallet with solana-keygen, configure your .env file with your wallet path and RPC endpoints, and start the training daemon with run-manager. Psyche's Solana-based consensus scores your node's contributions and gates rewards.

FAQ / 05

Frequently Asked Questions

Traditional distributed training runs inside a single datacenter over fast InfiniBand or NVLink fabric (400Gb/s+). Decentralized training runs across nodes in different datacenters or even different countries, using the public internet as the communication medium. Protocols like DiLoCo and OpenDiLoCo allow nodes to run thousands of local gradient steps between global synchronization rounds, reducing cross-node communication by 500x compared to data parallelism. This makes internet-bandwidth connections (10-100Gb/s) viable for meaningful pretraining contributions.

Requirements differ by protocol: Pluralis and Nous Psyche accept 40 GB+ GPUs, while Prime Intellect requires 80 GB (H100 or A100). All protocols require at least 10Gb/s network bandwidth, but latency tolerance varies: Prime Intellect requires under 100ms latency, while Pluralis and Nous Psyche tolerate up to 150ms and 200ms respectively. For Prime Intellect, a single H100 SXM5 node (8x GPUs) can run a full DiLoCo outer step in under 2 minutes for 8B-class models. Nous Psyche has the lowest entry barrier - even nodes with 40GB VRAM can participate in smaller subnet tasks.

DiLoCo (Distributed Low-Communication) replaces per-step gradient all-reduce with a two-level optimization loop. Each worker runs H=500 local SGD steps using an inner optimizer (AdamW). After H steps, workers exchange pseudo-gradients (difference between current and initial weights for those H steps) using an outer optimizer (SGD with Nesterov momentum). This reduces synchronization frequency by 500x versus standard data parallelism while matching single-datacenter training loss curves within 1-2%.

For a 7B model trained on 100B tokens, a 4-node internet-distributed setup using spot H100 GPUs costs roughly 40-55% less than an equivalent single-datacenter FSDP job on reserved clusters. The savings come from using spot pricing across multiple providers, eliminating InfiniBand reservation costs, and tolerating geographic distribution. The tradeoff is longer wall-clock time: with 500 local steps per sync round and ~150ms cross-region latency, end-to-end throughput is 15-20% lower than collocated training.

Yes, with checkpointing. Both Prime Intellect and Pluralis support node recovery: if a participant drops mid-run, the global model state is preserved and the remaining nodes continue. For Spheron spot instances, set up automatic checkpoint saving every outer DiLoCo step (roughly every 500 local steps). If the spot node is preempted, launch a replacement node from the same checkpoint. Prime Intellect's hivemind DHT coordinates recovery automatically. Nous Psyche uses Solana-based consensus for the same purpose.

What Decentralized Training Solves

Protocol Comparison: Pluralis vs Prime Intellect vs Nous Psyche

Hardware Requirements Per Node

Deploy a Prime Intellect Training Node on Spheron

Deploy a Pluralis Training Participant

Deploy a Nous Psyche Node

Cost Comparison: Decentralized vs Single-Datacenter Training

When Decentralized Training Fits (and When It Doesn't)

Common Failure Modes and Mitigations

Quick Setup Guide

Set up a Prime Intellect pretraining node on H100 (INTELLECT-1 style)

Join a Pluralis swarm as a signed-gradient participant

Deploy a Nous Psyche node with DisTrO optimizer

Frequently Asked Questions

01What is decentralized LLM training and how does it differ from traditional distributed training?

02What hardware do I need to join a Pluralis or Prime Intellect training run?

03How does DiLoCo reduce communication overhead in decentralized training?

04What is the cost difference between decentralized training and single-datacenter FSDP?

05Can I use Spheron's spot GPU instances for Prime Intellect or Pluralis nodes?

Build what's next.