Training a 70B base model to fluently reason within a proprietary domain is different from fine-tuning it to follow instructions. SFT changes how the model responds. Continuous pretraining (CPT) changes what the model knows at the weight level, by exposing it to domain text using the same next-token prediction objective used in original pretraining. This makes CPT the right tool for enterprises injecting legal contracts, clinical notes, financial filings, or scientific papers into frontier models, where vocabulary internalization and syntactic fluency matter, not just instruction compliance.
The cost is real: CPT runs are measured in GPU-days to GPU-weeks, and they carry a meaningful risk of catastrophic forgetting if done naively. This guide covers the full picture: data composition, learning-rate scheduling, forgetting prevention, hardware sizing, tooling, and a worked multi-node B200 recipe. For the underlying multi-node infrastructure setup, see our distributed LLM training guide. For the upstream data preparation step, see the AI pretraining data curation guide covering NeMo Curator, Datatrove, and FineWeb-Edu pipelines on GPU cloud.
CPT vs SFT vs DPO vs Distillation: Which Technique for Which Problem
Before spending GPU-hours on CPT, confirm it is actually what you need. Each technique changes a different thing:
| Technique | What It Changes | When to Use | Typical Token Budget | Memory Overhead vs Base |
|---|---|---|---|---|
| CPT | Vocabulary and syntactic internalization of domain corpus | Model regularly reasons over domain text (legal, clinical, scientific) | 1B-500B tokens | Same as pretraining |
| SFT | Instruction following and output format | Base model knows the domain but responds wrong | 1M-100M tokens | Same as pretraining |
| DPO | Preference alignment and response style | After SFT, to align preferences | 50K-500K pairs | Same as pretraining |
| Distillation | Compress larger teacher into smaller student | Deployment cost matters more than raw capability | Varies | Lower (smaller model) |
| RAG | Retrieval-augmented generation | Corpus changes frequently, staleness is acceptable | N/A | Embedding + vector DB |
The diagnostic question: run your base model on a sample of domain text and measure perplexity. If perplexity is high (the model is frequently surprised by the domain vocabulary), CPT is warranted. If perplexity is low but the model gives wrong answers, start with SFT.
DPO fine-tuning handles preference alignment without any online generation and is the right choice after SFT when you have human preference data. RLHF pipelines add the full actor-critic loop and are appropriate when you need a general-purpose scalar reward signal. Neither addresses domain vocabulary internalization - only CPT does.
One important clarification: CPT is not a substitute for RAG when documents change frequently. Legal regulations update. Clinical guidelines get revised. CPT produces a model fluent in a domain as it existed at training time. For dynamic knowledge, RAG plus a CPT-adapted base is often better than CPT alone.
Data Composition for CPT
The dataset construction choices have more impact on final quality than hyperparameter tuning. Getting the ratio wrong by 20% often costs more than a suboptimal learning rate.
Target-Domain Ratio
The domain-specific data should be 70-90% of total tokens during CPT, not 50/50. Below 60% domain ratio, the perplexity improvement on domain text is negligible after a standard CPT run. The model needs sustained exposure to internalize domain vocabulary co-occurrence statistics. At 50% domain, general data keeps "refreshing" what the domain data just taught, and the model never fully shifts its priors.
Practical targets by domain breadth:
- Narrow specialty (rare disease clinical notes, jurisdiction-specific case law): 85-90% domain
- Moderate breadth (general clinical notes, SEC filings across industries): 75-85% domain
- Broad domain (all financial documents, all biomedical literature): 70-80% domain
Replay Buffer
15-30% of tokens should come from the original pretraining distribution (The Pile, FineWeb, or similar) to prevent catastrophic forgetting of general capabilities. This is non-negotiable for production CPT runs.
Without replay, held-out general benchmarks degrade significantly after a 50B-token CPT run:
| Domain Ratio | Replay % | MMLU Regression (pp) | HellaSwag Regression (pp) |
|---|---|---|---|
| 100% domain | 0% | 12-18 | 10-15 |
| 90% domain | 10% | 6-10 | 5-9 |
| 80% domain | 20% | 2-4 | 2-3 |
| 70% domain | 30% | 1-2 | 1-2 |
These estimates are based on published Llama-2 CPT research. The 80/20 split is the standard starting point.
Tokenizer Extension
When the domain corpus has a high out-of-vocabulary rate (over 5% of tokens becoming multi-token splits of common domain terms), consider extending the base tokenizer. The steps:
- Train a BPE tokenizer on the domain corpus alone.
- Merge its vocabulary with the base tokenizer using SentencePiece merge or HuggingFace
tokenizer.add_tokens(). - Initialize new embedding rows by averaging the sub-token embeddings that previously represented each new token. Do not use random initialization - random init causes gradient instability in the first few hundred steps.
One critical detail: some Llama variants (e.g. Llama 3.2 1B/3B) tie the input and output embedding matrices; Llama 3.1 70B does not (tie_word_embeddings: false in its HF config). Always check config.tie_word_embeddings before resizing: if true, model.embed_tokens.weight and model.lm_head.weight reference the same tensor and you must update both consistently. Missing this causes a shape mismatch at the logit layer that silently produces garbage logits before crashing.
Tokenizer extension is only worth doing if the domain corpus is over 10B tokens and the OOV rate exceeds 5%. Below that threshold, the training overhead of new embedding initialization outweighs the per-token efficiency gain.
Learning-Rate Scheduling for CPT
The Linear Warmup Trap
Many practitioners copy a standard fine-tuning warmup schedule (1-5% of total steps at linear ramp) into CPT runs. This is wrong for CPT at scale.
Fine-tuning warmup percentages made sense when runs were 1,000-5,000 steps. CPT runs for 10,000-100,000+ steps. At 1% linear warmup on a 50,000-step CPT run, you ramp for 500 steps. That sounds reasonable, but the ramp is steep relative to what a pretrained model needs. The sudden gradient signal in those first 500 steps corrupts learned attention patterns before the optimizer state has built momentum. The model overshoots and spends the next 2,000 steps recovering.
The fix is simple: use an absolute step count for warmup (100-300 steps), not a percentage.
Recommended Schedule
- Peak LR: 10-20% of original pretraining peak LR. For Llama 3.1 70B (original lr=3e-4): use 3e-5 to 6e-5.
- Warmup: 200 steps linear warmup.
- Decay: cosine to a non-zero floor of 10% peak LR. Do not decay to zero. A zero floor pushes the model into a local minimum that overfits to domain data. Set
eta_minexplicitly. - Optional: cosine annealing with warm restarts (T=5,000-10,000 steps). Each restart briefly spikes the LR, helping escape local minima and improving diversity on downstream tasks.
from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR(
optimizer,
T_max=num_training_steps - warmup_steps,
eta_min=peak_lr * 0.1, # non-zero floor
)For the warmup phase, apply linear scaling manually or use a combined scheduler that handles warmup separately from cosine decay.
Catastrophic Forgetting: Measurement and Prevention
Measuring Forgetting
Establish baselines on a regression suite before the first training step:
- MMLU (general knowledge, 57 subjects)
- HellaSwag (commonsense reasoning)
- ARC-Challenge (science reasoning)
- GSM8K (math word problems)
- HumanEval (code completion)
Log all five every 5,000 training steps. A drop above 3 percentage points on any benchmark is a warning signal. Catching this at step 5,000 costs a checkpoint restart. Catching it at step 40,000 costs weeks of retraining.
Replay Buffer
The primary prevention mechanism. Keep a 20% replay stream of general-domain tokens mixed into every training batch. This is more effective than any regularization technique for 70B models, and it adds no computational overhead beyond the additional I/O for the replay data.
Elastic Weight Consolidation (EWC)
EWC adds a regularization term penalizing weight changes proportional to each weight's importance during original pretraining. The penalty:
L_EWC = λ * sum(F_i * (θ_i - θ*_i)^2)
Where F_i is the diagonal Fisher information estimate, θ_i is the current weight, and θ*_i is the pretrained checkpoint weight.
The catch: computing the Fisher matrix for a 70B model requires a full forward pass on thousands of samples and storing 70B float32 importance weights, approximately 280 GB in FP32. This is computationally expensive and memory-infeasible on most multi-node clusters. EWC is practical for 7B-13B models where the Fisher computation is tractable. For 70B, use replay buffers instead.
LoRA-CPT
An alternative to full-parameter CPT for smaller domain corpora (under 10B tokens). Apply LoRA adapters during CPT: only the adapter parameters update, so base model weights are preserved. Catastrophic forgetting cannot happen by construction because the base weights never change.
The tradeoff is fundamental: LoRA-CPT cannot fully internalize new vocabulary or deep syntactic patterns into the model's core representations. It is more accurately described as "CPT-flavored SFT" since it changes the model's output behavior conditioned on domain text, not its internal knowledge representations.
Use LoRA-CPT when:
- Domain corpus is 1B-10B tokens
- Catastrophic forgetting prevention is the primary concern
- You need to swap domain adapters at inference time
For fine-tuning framework support covering LoRA-CPT across Axolotl, Unsloth, and torchtune, see our framework comparison guide.
Hardware Planning: B200 Count for 70B CPT at 100B Tokens
This section works through the memory math explicitly so you can size your cluster before starting.
Memory breakdown for 70B full-parameter CPT in BF16:
| Component | Size (GB) |
|---|---|
| Model weights (BF16) | 140 |
| Optimizer states (AdamW FP32 m+v) | 560 |
| Gradients (BF16) | 140 |
| Activations (activation checkpointing, saves ~80%) | 30-50 |
| Total unsharded | ~870-890 GB |
With 8x B200 SXM (192 GB each = 1,536 GB total), FSDP with reshard_after_forward=True shards the full 870 GB across 8 GPUs: approximately 109-112 GB per GPU before activation peaks. This fits comfortably with meaningful headroom.
Token throughput and wall-clock estimate:
A single 8x B200 node with flash attention, gradient checkpointing, BF16 mixed precision, and batch size tuned for sequence length 4,096 achieves roughly 8,000-12,000 tokens/sec aggregate for a 70B model. At 10,000 tokens/sec (throughput drops 30-40% at seq_len=8,192) and a 62.5B total token run (50B domain, 12.5B replay):
62.5B tokens / 10,000 tokens/sec = 6,250,000 seconds ≈ 72 days (single node)That is too slow for production use. Scale the cluster:
| Token Budget | Domain % | Replay % | Recommended Setup | Wall-Clock (est.) |
|---|---|---|---|---|
| 10B tokens | 80% | 20% | 8x B200 (1 node) | ~11-12 days |
| 50B tokens | 80% | 20% | 4 nodes (32x B200) | ~14-15 days |
| 100B tokens | 80% | 20% | 8 nodes (64x B200) | ~14-15 days |
Throughput estimates are for sequence length 4,096 with activation checkpointing enabled. At sequence length 8,192, throughput drops approximately 30-40%, so budget accordingly for long-context domain corpora.
See Spheron B200 rental for InfiniBand-enabled clusters, or H200 on Spheron as the next-tier option with 141 GB HBM3e per GPU (fits 70B inference on a single H200 without sharding; training still requires FSDP).
Tooling Walkthrough
torchtune
torchtune's full_finetune_distributed recipe supports CPT with minimal changes from its SFT configuration. The key difference is using TextCompletionDataset instead of an instruction-format dataset, and configuring a blended corpus with dataset weights.
# torchtune CPT config (llama3_1/70B_full.yaml base, CPT variant)
model:
_component_: torchtune.models.llama3_1.llama3_1_70b
tokenizer:
_component_: torchtune.models.llama3_1.llama3_tokenizer
path: /models/llama3-1-70b/original/tokenizer.model
dataset:
_component_: torchtune.datasets.ConcatDataset
datasets:
- _component_: torchtune.datasets.TextCompletionDataset
source: /data/domain/train.jsonl
column: text
max_seq_len: 4096
- _component_: torchtune.datasets.TextCompletionDataset
source: /data/replay/general.jsonl
column: text
max_seq_len: 4096
weights: [0.8, 0.2] # domain 80%, replay 20%
optimizer:
_component_: torch.optim.AdamW
lr: 3e-5 # 10% of original 3e-4 pretraining LR
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 200
num_training_steps: ${training.max_steps}
min_lr: 3e-6 # non-zero cosine floor
training:
gradient_checkpointing: True
batch_size: 2
gradient_accumulation_steps: 8
max_steps: 25000torchtune handles FSDP2 sharding automatically when launched with torchrun. For multi-node, add --nnodes and --master_addr to the launcher command.
Megatron-Core
Megatron-Core is the right choice for CPT runs on models larger than 70B or multi-node clusters requiring 3D parallelism (tensor, pipeline, and data parallelism together). Use the pretrain_gpt.py script with the --finetune flag and a custom data blend:
python pretrain_gpt.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 4 \
--num-layers 80 \
--hidden-size 8192 \
--num-attention-heads 64 \
--seq-length 4096 \
--max-position-embeddings 131072 \
--train-iters 25000 \
--lr 3e-5 \
--min-lr 3e-6 \
--lr-warmup-iters 200 \
--lr-decay-style cosine \
--data-blend "0.8 /data/domain/processed 0.2 /data/replay/general" \
--finetune \
--load /checkpoints/llama3-70b-megatron/The --finetune flag loads the pretrained checkpoint but does not restore the optimizer state, which is the correct behavior for CPT: you want fresh optimizer momentum rather than momentum accumulated during the original pretraining run.
Hugging Face Transformers
For teams already on the HF stack, Trainer supports CPT with minimal changes. Use DataCollatorForLanguageModeling(tokenizer, mlm=False) as the data collator to enable causal LM training. Build a blended corpus using IterableDataset with interleaved sampling from domain and replay sources.
For throughput at 70B scale, HF Trainer with FSDP Accelerate is usable but significantly slower than Megatron or torchtune. Plan for 20-30% lower tokens/sec compared to a well-tuned Megatron setup. If the run is short (under 10B tokens) and the team is already on HF tooling, the convenience often outweighs the throughput gap. For 50B+ token runs, the throughput difference compounds into days of additional wall-clock time.
For multi-node launcher patterns, see our distributed LLM training guide.
Evaluation: Domain Benchmarks + General Regression
CPT evaluation requires two parallel tracks, and you must run both. Skipping general regression is how teams end up with a model that aces domain evals and fails basic arithmetic.
Domain Benchmark Track
Build or use existing benchmarks for your target domain:
- Legal: LegalBench, LEXGLUE (multi-label classification on European Court of Human Rights, Court of Justice EU, and ECtHR datasets)
- Biomedical: MedQA (USMLE format), PubMedQA (yes/no/maybe questions from PubMed abstracts)
- Finance: FinanceBench (question answering over 10-K filings), ConvFinQA (numerical reasoning)
- Code: HumanEval, SWE-bench Verified
Measure at every major checkpoint. Perplexity on held-out domain text is a fast proxy metric to monitor every 1,000 steps.
General Regression Track
Run MMLU, HellaSwag, ARC-Challenge, GSM8K, and HumanEval on every fifth checkpoint. A properly tuned CPT run with 20% replay should show no more than 1-2 percentage point absolute drop on MMLU and HellaSwag. More than 3 points means the replay ratio is too low.
Evaluation cadence:
| Steps | Action |
|---|---|
| Baseline (step 0) | Both tracks, save all scores |
| Every 1,000 steps | Training loss and domain perplexity |
| Every 5,000 steps | Full domain benchmark eval |
| Every 5,000 steps (offset by 2,500) | Full general regression eval |
| Final checkpoint | Both tracks + human eval on 50 domain prompts |
For automated judge-model evaluation of open-ended domain responses (beyond multiple-choice benchmarks), a self-hosted LLM judge running on spot instances is cost-effective at this evaluation frequency.
Cost Economics: CPT vs API Fine-Tuning vs RAG
A complete cost comparison for a 50M-token enterprise corpus (realistic for a medium-sized legal or biomedical use case):
| Approach | One-Time Setup | Recurring Cost | Inference Latency | Staleness Risk |
|---|---|---|---|---|
| CPT (50B token run, 8x B200) | GPU compute for run | Inference only post-training | Low (~20ms) | Retrain on corpus updates |
| API fine-tuning (GPT-4o, Claude) | Low | Per-token or per-call | Varies | Model updates may shift behavior |
| RAG (self-hosted) | Embedding + vector DB setup | Query compute per call | +50-200ms (retrieval) | Near-real-time updates |
Worked cost example for a 70B CPT run on Spheron B200:
Live B200 pricing as of today: $1.71/hr per GPU (lowest on-demand).
Configuration: 8 nodes x 8x B200 = 64 GPUs
Wall-clock: ~14-15 days for 100B tokens
Hours: 14.5 days x 24 hours = 348 hours
Cost: 64 GPUs x $1.71/hr x 348 hours = $38,085.12Compare to a hyperscaler equivalent: GCP A3 Mega (8x H100) at approximately $32/hr per node, 8 nodes:
8 nodes x $32/hr/node x 348 hours = $89,088The CPT run on Spheron costs roughly half the hyperscaler rate, with no reservation commitment required. The Spheron model is per-hour, per-GPU, so you pay only for the hours the cluster is running. A hyperscaler reservation that delivers similar guaranteed availability would require a 1- or 3-year commitment at contract pricing.
For teams running multiple CPT experiments (different domain corpus compositions, replay ratios, LR schedules), the spot pricing model is even more compelling. See our spot GPU training case study for a worked example of a team training a 70B model using spot instances for interruptible CPT checkpoint restarts.
Pricing fluctuates based on GPU availability. The prices above are based on 10 May 2026 and may have changed. Check current GPU pricing → for live rates.
Spheron Deployment: Multi-Node B200 CPT Recipe
Cluster Setup
- Provision 8 nodes of 8x B200 SXM on Spheron with InfiniBand networking enabled.
- Configure SSH trust between nodes. Set
MASTER_ADDRto node 0's private IP andMASTER_PORT=29500. - Mount shared storage (NFS or Spheron volume) at
/datafor the dataset and/checkpointsfor checkpoint writes. - Verify InfiniBand with
ibstatand NVLink withnvidia-smi nvlink --statusbefore starting training.
NVLink + InfiniBand for FSDP/Megatron Sharding
Two distinct interconnects handle different communication patterns:
- Within each node: Tensor parallelism (TP=8) uses NVLink 5, which provides 1.8 TB/s bidirectional per-GPU bandwidth on B200 SXM (900 GB/s per direction). All-to-all within a node is fast enough to not be the bottleneck.
- Across nodes: Pipeline parallelism (PP=8, one pipeline stage per node) and FSDP all-gather for gradient synchronization cross InfiniBand. B200 nodes typically come with 400 Gb/s NDR InfiniBand per port.
NCCL configuration for InfiniBand:
export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TIMEOUT=23
export NCCL_NET_GDR_READ=1
export NCCL_SOCKET_IFNAME=eth0For NCCL socket buffer sizes, tree vs ring algorithm selection, and IB congestion control, see our full NCCL tuning guide.
Reproducible CPT Recipe
#!/bin/bash
# 8-node B200 CPT launch for 70B model
NNODES=8
NPROC_PER_NODE=8
MASTER_ADDR=$1 # pass node 0 IP as argument
MASTER_PORT=29500
torchrun \
--nnodes=$NNODES \
--nproc_per_node=$NPROC_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
pretrain.py \
--config configs/llama3_70b_cpt.yaml \
--resume_from_checkpoint /checkpoints/latest/Checkpoint strategy: Save every 1,000 steps in safetensors format. Safetensors is memory-mapped, so loading a checkpoint for inspection or resumption does not require loading the full model into CPU RAM first. Keep the last 3 checkpoints on disk and archive earlier ones to cold storage. CPT runs on cloud hardware can be preempted (especially if using spot instances for non-training nodes), so a 1,000-step save cadence means the worst-case restart cost is under 2 hours of compute.
Monitoring during the run: Log training loss, domain perplexity (on held-out domain text), learning rate, and gradient norm every 50 steps. A sudden spike in gradient norm that does not recover within 200 steps usually indicates the LR is too high or the replay buffer is not adequately shuffled.
Continuous pretraining runs on 70B models require 24-168 GPU-hours on multi-node clusters. Spheron's B200 and H200 reserved clusters provide InfiniBand fabric and transparent per-hour pricing without hyperscaler lock-in.
Rent B200 on Spheron → | Rent H200 on Spheron → | View pricing →
