What is the difference between continuous pretraining and fine-tuning?

Continuous pretraining (CPT) trains the model on raw text using the standard next-token prediction objective, the same as original pretraining. The model internalizes domain vocabulary, co-occurrence statistics, and syntactic patterns from the corpus. Fine-tuning (SFT) trains on instruction-response pairs to change how the model responds, not what it fundamentally knows. CPT changes the model's internal knowledge and language representations; SFT changes its behavior and output format. Use CPT when you need the model to fluently reason within a domain, not just follow instructions about it.

How do I prevent catastrophic forgetting during continuous pretraining?

The primary mechanism is a replay buffer: mix 15-30% of tokens from the original pretraining distribution (The Pile, FineWeb, or a similar general corpus) into every training batch alongside your domain data. This prevents the model from overfitting entirely to the domain distribution and forgetting general reasoning capabilities. For 70B models, replay buffers are the standard approach. Elastic Weight Consolidation (EWC) is an alternative for smaller models (7B-13B) but is computationally infeasible at 70B scale due to the Fisher matrix storage requirement. LoRA-CPT is another option for smaller corpora under 10B tokens.

How many GPUs do I need for continuous pretraining a 70B model?

A 70B model in BF16 with AdamW optimizer states requires roughly 870-890 GB unsharded VRAM. One 8x B200 node (192 GB each, 1,536 GB total) handles this with FSDP, at approximately 109-112 GB per GPU after sharding. For 62.5B total tokens (50B domain + 12.5B replay at 80/20) at 10,000 tokens/sec, one node takes 72 days - too slow for production. Scale to 4 nodes (32x B200) for 18 days, or 8 nodes (64x B200) for a 100B-token run in approximately 14-15 days.

What data ratio should I use between domain-specific and general data?

Use 70-90% domain-specific data and 10-30% general replay data. Below 60% domain ratio, perplexity improvement on domain text is negligible after a full CPT run. The 15-30% replay buffer prevents catastrophic forgetting of general capabilities. Without replay, held-out general benchmarks (MMLU, HellaSwag) degrade 8-15 percentage points after a 50B-token CPT run on a narrow domain. The exact split depends on domain breadth: narrow domains (specialized legal subfields, rare clinical niches) benefit from pushing domain ratio toward 90%; broader domains (financial documents, general biomedical) work well at 75-80%.

How does continuous pretraining compare to RAG for enterprise knowledge injection?

CPT and RAG solve different problems. CPT internalizes domain knowledge into the model's weights, making the model fluent in the domain's language and reasoning patterns. RAG retrieves relevant documents at inference time and conditions generation on them. RAG is better when corpora change frequently, when documents need to be cited, or when you cannot afford a CPT run. CPT is better when the domain has stable knowledge, when the model needs to reason deeply within the domain (not just surface relevant passages), or when low-latency inference without retrieval overhead matters. Many production systems use both: CPT for domain fluency, RAG for specific fact retrieval.

Continuous Pretraining on GPU Cloud: Domain Adaptation for Frontier LLMs Without Catastrophic Forgetting (2026 Guide)

Training a 70B base model to fluently reason within a proprietary domain is different from fine-tuning it to follow instructions. SFT changes how the model responds. Continuous pretraining (CPT) changes what the model knows at the weight level, by exposing it to domain text using the same next-token prediction objective used in original pretraining. This makes CPT the right tool for enterprises injecting legal contracts, clinical notes, financial filings, or scientific papers into frontier models, where vocabulary internalization and syntactic fluency matter, not just instruction compliance.

The cost is real: CPT runs are measured in GPU-days to GPU-weeks, and they carry a meaningful risk of catastrophic forgetting if done naively. This guide covers the full picture: data composition, learning-rate scheduling, forgetting prevention, hardware sizing, tooling, and a worked multi-node B200 recipe. For the underlying multi-node infrastructure setup, see our distributed LLM training guide. For the upstream data preparation step, see the AI pretraining data curation guide covering NeMo Curator, Datatrove, and FineWeb-Edu pipelines on GPU cloud.

CPT vs SFT vs DPO vs Distillation: Which Technique for Which Problem

Before spending GPU-hours on CPT, confirm it is actually what you need. Each technique changes a different thing:

Technique	What It Changes	When to Use	Typical Token Budget	Memory Overhead vs Base
CPT	Vocabulary and syntactic internalization of domain corpus	Model regularly reasons over domain text (legal, clinical, scientific)	1B-500B tokens	Same as pretraining
SFT	Instruction following and output format	Base model knows the domain but responds wrong	1M-100M tokens	Same as pretraining
DPO	Preference alignment and response style	After SFT, to align preferences	50K-500K pairs	Same as pretraining
Distillation	Compress larger teacher into smaller student	Deployment cost matters more than raw capability	Varies	Lower (smaller model)
RAG	Retrieval-augmented generation	Corpus changes frequently, staleness is acceptable	N/A	Embedding + vector DB

The diagnostic question: run your base model on a sample of domain text and measure perplexity. If perplexity is high (the model is frequently surprised by the domain vocabulary), CPT is warranted. If perplexity is low but the model gives wrong answers, start with SFT.

DPO fine-tuning handles preference alignment without any online generation and is the right choice after SFT when you have human preference data. RLHF pipelines add the full actor-critic loop and are appropriate when you need a general-purpose scalar reward signal. Neither addresses domain vocabulary internalization - only CPT does.

One important clarification: CPT is not a substitute for RAG when documents change frequently. Legal regulations update. Clinical guidelines get revised. CPT produces a model fluent in a domain as it existed at training time. For dynamic knowledge, RAG plus a CPT-adapted base is often better than CPT alone.

Data Composition for CPT

The dataset construction choices have more impact on final quality than hyperparameter tuning. Getting the ratio wrong by 20% often costs more than a suboptimal learning rate.

Target-Domain Ratio

The domain-specific data should be 70-90% of total tokens during CPT, not 50/50. Below 60% domain ratio, the perplexity improvement on domain text is negligible after a standard CPT run. The model needs sustained exposure to internalize domain vocabulary co-occurrence statistics. At 50% domain, general data keeps "refreshing" what the domain data just taught, and the model never fully shifts its priors.

Practical targets by domain breadth:

Narrow specialty (rare disease clinical notes, jurisdiction-specific case law): 85-90% domain
Moderate breadth (general clinical notes, SEC filings across industries): 75-85% domain
Broad domain (all financial documents, all biomedical literature): 70-80% domain

Replay Buffer

15-30% of tokens should come from the original pretraining distribution (The Pile, FineWeb, or similar) to prevent catastrophic forgetting of general capabilities. This is non-negotiable for production CPT runs.

Without replay, held-out general benchmarks degrade significantly after a 50B-token CPT run:

Domain Ratio	Replay %	MMLU Regression (pp)	HellaSwag Regression (pp)
100% domain	0%	12-18	10-15
90% domain	10%	6-10	5-9
80% domain	20%	2-4	2-3
70% domain	30%	1-2	1-2

These estimates are based on published Llama-2 CPT research. The 80/20 split is the standard starting point.

Tokenizer Extension

When the domain corpus has a high out-of-vocabulary rate (over 5% of tokens becoming multi-token splits of common domain terms), consider extending the base tokenizer. The steps:

Train a BPE tokenizer on the domain corpus alone.
Merge its vocabulary with the base tokenizer using SentencePiece merge or HuggingFace tokenizer.add_tokens().
Initialize new embedding rows by averaging the sub-token embeddings that previously represented each new token. Do not use random initialization - random init causes gradient instability in the first few hundred steps.

One critical detail: some Llama variants (e.g. Llama 3.2 1B/3B) tie the input and output embedding matrices; Llama 3.1 70B does not (tie_word_embeddings: false in its HF config). Always check config.tie_word_embeddings before resizing: if true, model.embed_tokens.weight and model.lm_head.weight reference the same tensor and you must update both consistently. Missing this causes a shape mismatch at the logit layer that silently produces garbage logits before crashing.

Tokenizer extension is only worth doing if the domain corpus is over 10B tokens and the OOV rate exceeds 5%. Below that threshold, the training overhead of new embedding initialization outweighs the per-token efficiency gain.

Learning-Rate Scheduling for CPT

The Linear Warmup Trap

Many practitioners copy a standard fine-tuning warmup schedule (1-5% of total steps at linear ramp) into CPT runs. This is wrong for CPT at scale.

Fine-tuning warmup percentages made sense when runs were 1,000-5,000 steps. CPT runs for 10,000-100,000+ steps. At 1% linear warmup on a 50,000-step CPT run, you ramp for 500 steps. That sounds reasonable, but the ramp is steep relative to what a pretrained model needs. The sudden gradient signal in those first 500 steps corrupts learned attention patterns before the optimizer state has built momentum. The model overshoots and spends the next 2,000 steps recovering.

The fix is simple: use an absolute step count for warmup (100-300 steps), not a percentage.

Recommended Schedule

Peak LR: 10-20% of original pretraining peak LR. For Llama 3.1 70B (original lr=3e-4): use 3e-5 to 6e-5.
Warmup: 200 steps linear warmup.
Decay: cosine to a non-zero floor of 10% peak LR. Do not decay to zero. A zero floor pushes the model into a local minimum that overfits to domain data. Set eta_min explicitly.
Optional: cosine annealing with warm restarts (T=5,000-10,000 steps). Each restart briefly spikes the LR, helping escape local minima and improving diversity on downstream tasks.

python

from torch.optim.lr_scheduler import CosineAnnealingLR

scheduler = CosineAnnealingLR(
    optimizer,
    T_max=num_training_steps - warmup_steps,
    eta_min=peak_lr * 0.1,  # non-zero floor
)

For the warmup phase, apply linear scaling manually or use a combined scheduler that handles warmup separately from cosine decay.

Catastrophic Forgetting: Measurement and Prevention

Measuring Forgetting

Establish baselines on a regression suite before the first training step:

MMLU (general knowledge, 57 subjects)
HellaSwag (commonsense reasoning)
ARC-Challenge (science reasoning)
GSM8K (math word problems)
HumanEval (code completion)

Log all five every 5,000 training steps. A drop above 3 percentage points on any benchmark is a warning signal. Catching this at step 5,000 costs a checkpoint restart. Catching it at step 40,000 costs weeks of retraining.

Replay Buffer

The primary prevention mechanism. Keep a 20% replay stream of general-domain tokens mixed into every training batch. This is more effective than any regularization technique for 70B models, and it adds no computational overhead beyond the additional I/O for the replay data.

Elastic Weight Consolidation (EWC)

EWC adds a regularization term penalizing weight changes proportional to each weight's importance during original pretraining. The penalty:

L_EWC = λ * sum(F_i * (θ_i - θ*_i)^2)

Where F_i is the diagonal Fisher information estimate, θ_i is the current weight, and θ*_i is the pretrained checkpoint weight.

The catch: computing the Fisher matrix for a 70B model requires a full forward pass on thousands of samples and storing 70B float32 importance weights, approximately 280 GB in FP32. This is computationally expensive and memory-infeasible on most multi-node clusters. EWC is practical for 7B-13B models where the Fisher computation is tractable. For 70B, use replay buffers instead.

LoRA-CPT

An alternative to full-parameter CPT for smaller domain corpora (under 10B tokens). Apply LoRA adapters during CPT: only the adapter parameters update, so base model weights are preserved. Catastrophic forgetting cannot happen by construction because the base weights never change.

The tradeoff is fundamental: LoRA-CPT cannot fully internalize new vocabulary or deep syntactic patterns into the model's core representations. It is more accurately described as "CPT-flavored SFT" since it changes the model's output behavior conditioned on domain text, not its internal knowledge representations.

Use LoRA-CPT when:

Domain corpus is 1B-10B tokens
Catastrophic forgetting prevention is the primary concern
You need to swap domain adapters at inference time

For fine-tuning framework support covering LoRA-CPT across Axolotl, Unsloth, and torchtune, see our framework comparison guide.

Hardware Planning: B200 Count for 70B CPT at 100B Tokens

This section works through the memory math explicitly so you can size your cluster before starting.

Memory breakdown for 70B full-parameter CPT in BF16:

Component	Size (GB)
Model weights (BF16)	140
Optimizer states (AdamW FP32 m+v)	560
Gradients (BF16)	140
Activations (activation checkpointing, saves ~80%)	30-50
Total unsharded	~870-890 GB

With 8x B200 SXM (192 GB each = 1,536 GB total), FSDP with reshard_after_forward=True shards the full 870 GB across 8 GPUs: approximately 109-112 GB per GPU before activation peaks. This fits comfortably with meaningful headroom.

Token throughput and wall-clock estimate:

A single 8x B200 node with flash attention, gradient checkpointing, BF16 mixed precision, and batch size tuned for sequence length 4,096 achieves roughly 8,000-12,000 tokens/sec aggregate for a 70B model. At 10,000 tokens/sec (throughput drops 30-40% at seq_len=8,192) and a 62.5B total token run (50B domain, 12.5B replay):

62.5B tokens / 10,000 tokens/sec = 6,250,000 seconds ≈ 72 days (single node)

That is too slow for production use. Scale the cluster:

Token Budget	Domain %	Replay %	Recommended Setup	Wall-Clock (est.)
10B tokens	80%	20%	8x B200 (1 node)	~11-12 days
50B tokens	80%	20%	4 nodes (32x B200)	~14-15 days
100B tokens	80%	20%	8 nodes (64x B200)	~14-15 days

Throughput estimates are for sequence length 4,096 with activation checkpointing enabled. At sequence length 8,192, throughput drops approximately 30-40%, so budget accordingly for long-context domain corpora.

See Spheron B200 rental for InfiniBand-enabled clusters, or H200 on Spheron as the next-tier option with 141 GB HBM3e per GPU (fits 70B inference on a single H200 without sharding; training still requires FSDP).

Tooling Walkthrough

torchtune

torchtune's full_finetune_distributed recipe supports CPT with minimal changes from its SFT configuration. The key difference is using TextCompletionDataset instead of an instruction-format dataset, and configuring a blended corpus with dataset weights.

yaml

# torchtune CPT config (llama3_1/70B_full.yaml base, CPT variant)
model:
  _component_: torchtune.models.llama3_1.llama3_1_70b

tokenizer:
  _component_: torchtune.models.llama3_1.llama3_tokenizer
  path: /models/llama3-1-70b/original/tokenizer.model

dataset:
  _component_: torchtune.datasets.ConcatDataset
  datasets:
    - _component_: torchtune.datasets.TextCompletionDataset
      source: /data/domain/train.jsonl
      column: text
      max_seq_len: 4096
    - _component_: torchtune.datasets.TextCompletionDataset
      source: /data/replay/general.jsonl
      column: text
      max_seq_len: 4096
  weights: [0.8, 0.2]  # domain 80%, replay 20%

optimizer:
  _component_: torch.optim.AdamW
  lr: 3e-5  # 10% of original 3e-4 pretraining LR

lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 200
  num_training_steps: ${training.max_steps}
  min_lr: 3e-6  # non-zero cosine floor

training:
  gradient_checkpointing: True
  batch_size: 2
  gradient_accumulation_steps: 8
  max_steps: 25000

torchtune handles FSDP2 sharding automatically when launched with torchrun. For multi-node, add --nnodes and --master_addr to the launcher command.

Megatron-Core

Megatron-Core is the right choice for CPT runs on models larger than 70B or multi-node clusters requiring 3D parallelism (tensor, pipeline, and data parallelism together). Use the pretrain_gpt.py script with the --finetune flag and a custom data blend:

bash

python pretrain_gpt.py \
  --tensor-model-parallel-size 8 \
  --pipeline-model-parallel-size 4 \
  --num-layers 80 \
  --hidden-size 8192 \
  --num-attention-heads 64 \
  --seq-length 4096 \
  --max-position-embeddings 131072 \
  --train-iters 25000 \
  --lr 3e-5 \
  --min-lr 3e-6 \
  --lr-warmup-iters 200 \
  --lr-decay-style cosine \
  --data-blend "0.8 /data/domain/processed 0.2 /data/replay/general" \
  --finetune \
  --load /checkpoints/llama3-70b-megatron/

The --finetune flag loads the pretrained checkpoint but does not restore the optimizer state, which is the correct behavior for CPT: you want fresh optimizer momentum rather than momentum accumulated during the original pretraining run.

Hugging Face Transformers

For teams already on the HF stack, Trainer supports CPT with minimal changes. Use DataCollatorForLanguageModeling(tokenizer, mlm=False) as the data collator to enable causal LM training. Build a blended corpus using IterableDataset with interleaved sampling from domain and replay sources.

For throughput at 70B scale, HF Trainer with FSDP Accelerate is usable but significantly slower than Megatron or torchtune. Plan for 20-30% lower tokens/sec compared to a well-tuned Megatron setup. If the run is short (under 10B tokens) and the team is already on HF tooling, the convenience often outweighs the throughput gap. For 50B+ token runs, the throughput difference compounds into days of additional wall-clock time.

For multi-node launcher patterns, see our distributed LLM training guide.

Evaluation: Domain Benchmarks + General Regression

CPT evaluation requires two parallel tracks, and you must run both. Skipping general regression is how teams end up with a model that aces domain evals and fails basic arithmetic.

Domain Benchmark Track

Build or use existing benchmarks for your target domain:

Legal: LegalBench, LEXGLUE (multi-label classification on European Court of Human Rights, Court of Justice EU, and ECtHR datasets)
Biomedical: MedQA (USMLE format), PubMedQA (yes/no/maybe questions from PubMed abstracts)
Finance: FinanceBench (question answering over 10-K filings), ConvFinQA (numerical reasoning)
Code: HumanEval, SWE-bench Verified

Measure at every major checkpoint. Perplexity on held-out domain text is a fast proxy metric to monitor every 1,000 steps.

General Regression Track

Run MMLU, HellaSwag, ARC-Challenge, GSM8K, and HumanEval on every fifth checkpoint. A properly tuned CPT run with 20% replay should show no more than 1-2 percentage point absolute drop on MMLU and HellaSwag. More than 3 points means the replay ratio is too low.

Evaluation cadence:

Steps	Action
Baseline (step 0)	Both tracks, save all scores
Every 1,000 steps	Training loss and domain perplexity
Every 5,000 steps	Full domain benchmark eval
Every 5,000 steps (offset by 2,500)	Full general regression eval
Final checkpoint	Both tracks + human eval on 50 domain prompts

For automated judge-model evaluation of open-ended domain responses (beyond multiple-choice benchmarks), a self-hosted LLM judge running on spot instances is cost-effective at this evaluation frequency.

Cost Economics: CPT vs API Fine-Tuning vs RAG

A complete cost comparison for a 50M-token enterprise corpus (realistic for a medium-sized legal or biomedical use case):

Approach	One-Time Setup	Recurring Cost	Inference Latency	Staleness Risk
CPT (50B token run, 8x B200)	GPU compute for run	Inference only post-training	Low (~20ms)	Retrain on corpus updates
API fine-tuning (GPT-4o, Claude)	Low	Per-token or per-call	Varies	Model updates may shift behavior
RAG (self-hosted)	Embedding + vector DB setup	Query compute per call	+50-200ms (retrieval)	Near-real-time updates

Worked cost example for a 70B CPT run on Spheron B200:

Live B200 pricing as of today: $1.71/hr per GPU (lowest on-demand).

Configuration: 8 nodes x 8x B200 = 64 GPUs
Wall-clock: ~14-15 days for 100B tokens
Hours: 14.5 days x 24 hours = 348 hours

Cost: 64 GPUs x $1.71/hr x 348 hours = $38,085.12

Compare to a hyperscaler equivalent: GCP A3 Mega (8x H100) at approximately $32/hr per node, 8 nodes:

8 nodes x $32/hr/node x 348 hours = $89,088

The CPT run on Spheron costs roughly half the hyperscaler rate, with no reservation commitment required. The Spheron model is per-hour, per-GPU, so you pay only for the hours the cluster is running. A hyperscaler reservation that delivers similar guaranteed availability would require a 1- or 3-year commitment at contract pricing.

For teams running multiple CPT experiments (different domain corpus compositions, replay ratios, LR schedules), the spot pricing model is even more compelling. See our spot GPU training case study for a worked example of a team training a 70B model using spot instances for interruptible CPT checkpoint restarts.

Pricing fluctuates based on GPU availability. The prices above are based on 10 May 2026 and may have changed. Check current GPU pricing → for live rates.

Spheron Deployment: Multi-Node B200 CPT Recipe

Cluster Setup

Provision 8 nodes of 8x B200 SXM on Spheron with InfiniBand networking enabled.
Configure SSH trust between nodes. Set MASTER_ADDR to node 0's private IP and MASTER_PORT=29500.
Mount shared storage (NFS or Spheron volume) at /data for the dataset and /checkpoints for checkpoint writes.
Verify InfiniBand with ibstat and NVLink with nvidia-smi nvlink --status before starting training.

NVLink + InfiniBand for FSDP/Megatron Sharding

Two distinct interconnects handle different communication patterns:

Within each node: Tensor parallelism (TP=8) uses NVLink 5, which provides 1.8 TB/s bidirectional per-GPU bandwidth on B200 SXM (900 GB/s per direction). All-to-all within a node is fast enough to not be the bottleneck.
Across nodes: Pipeline parallelism (PP=8, one pipeline stage per node) and FSDP all-gather for gradient synchronization cross InfiniBand. B200 nodes typically come with 400 Gb/s NDR InfiniBand per port.

NCCL configuration for InfiniBand:

bash

export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TIMEOUT=23
export NCCL_NET_GDR_READ=1
export NCCL_SOCKET_IFNAME=eth0

For NCCL socket buffer sizes, tree vs ring algorithm selection, and IB congestion control, see our full NCCL tuning guide.

Reproducible CPT Recipe

bash

#!/bin/bash
# 8-node B200 CPT launch for 70B model
NNODES=8
NPROC_PER_NODE=8
MASTER_ADDR=$1  # pass node 0 IP as argument
MASTER_PORT=29500

torchrun \
  --nnodes=$NNODES \
  --nproc_per_node=$NPROC_PER_NODE \
  --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  pretrain.py \
    --config configs/llama3_70b_cpt.yaml \
    --resume_from_checkpoint /checkpoints/latest/

Checkpoint strategy: Save every 1,000 steps in safetensors format. Safetensors is memory-mapped, so loading a checkpoint for inspection or resumption does not require loading the full model into CPU RAM first. Keep the last 3 checkpoints on disk and archive earlier ones to cold storage. CPT runs on cloud hardware can be preempted (especially if using spot instances for non-training nodes), so a 1,000-step save cadence means the worst-case restart cost is under 2 hours of compute.

Monitoring during the run: Log training loss, domain perplexity (on held-out domain text), learning rate, and gradient norm every 50 steps. A sudden spike in gradient norm that does not recover within 200 steps usually indicates the LR is too high or the replay buffer is not adequately shuffled.

Continuous pretraining runs on 70B models require 24-168 GPU-hours on multi-node clusters. Spheron's B200 and H200 reserved clusters provide InfiniBand fabric and transparent per-hour pricing without hyperscaler lock-in.
Rent B200 on Spheron → | Rent H200 on Spheron → | View pricing →

CPT vs SFT vs DPO vs Distillation: Which Technique for Which Problem

Data Composition for CPT

Target-Domain Ratio

Replay Buffer

Tokenizer Extension

Learning-Rate Scheduling for CPT

The Linear Warmup Trap

Recommended Schedule

Catastrophic Forgetting: Measurement and Prevention

Measuring Forgetting

Replay Buffer

Elastic Weight Consolidation (EWC)

LoRA-CPT

Hardware Planning: B200 Count for 70B CPT at 100B Tokens

Tooling Walkthrough

torchtune

Megatron-Core

Hugging Face Transformers

Evaluation: Domain Benchmarks + General Regression

Domain Benchmark Track

General Regression Track

Cost Economics: CPT vs API Fine-Tuning vs RAG

Spheron Deployment: Multi-Node B200 CPT Recipe

Cluster Setup

NVLink + InfiniBand for FSDP/Megatron Sharding

Reproducible CPT Recipe

Build what's next.