Why does pretraining data curation need GPUs in 2026?

At hundred-billion to trillion-token scale, CPU-based MinHash deduplication takes weeks. GPU-accelerated fuzzy dedup with cuDF and RAPIDS runs the same job in hours on a single 8x H100 node. Language ID, quality classification, and PII scanning also parallelize well on GPU, making the full pipeline 10-20x faster than CPU-only approaches.

What is the difference between NeMo Curator and Datatrove?

NeMo Curator is GPU-native: it uses RAPIDS/cuDF for exact and fuzzy deduplication and Dask for distributed processing, making it the fastest option for dedup-heavy pipelines over 10T tokens. Datatrove is a CPU-first modular pipeline from Hugging Face that implements the FineWeb and FineWeb-Edu reference pipeline. Datatrove is easier to reproduce published datasets; NeMo Curator is faster for novel large-scale runs.

How do I reproduce the FineWeb-Edu pipeline on a GPU cluster?

Install Datatrove with the processing extras, configure a SlurmPipelineExecutor pointing at a shared NFS mount, and chain WarcReader, LanguageFilter, GopherQualityFilter, C4QualityFilter, FineWebEduClassifier (threshold >= 3), and MinHashDeduplicator steps. On a Spheron cluster with Slurm, a 512-task job across 64 workers will process a full CommonCrawl dump in 3-5 days. Expect 5-7% of raw tokens to pass all filters.

How many GPU-hours does it take to curate 1 trillion tokens?

On a 64x H100 SXM5 cluster, processing 100T raw tokens down to roughly 3-7T high-quality output tokens takes approximately 24-32 wall-clock hours, or about 1,536-2,048 GPU-hours total. At $2.57/GPU/hr on-demand ($1.52/GPU/hr spot), that puts the cost per trillion curated output tokens between $560-$790 on-demand or $330-$470 at spot rates, depending on pipeline complexity and pass rate.

How does curated data hand off to Megatron-Core or TorchTitan?

NeMo Curator writes Parquet shards natively. Megatron-Core's preprocess_data.py expects JSONL, so convert Parquet to JSONL with PyArrow before indexing. TorchTitan reads WebDataset tar archives natively via torchdata. NeMo 2.0 accepts the .nemo_data format that NeMo Curator writes directly. Always generate index files for lazy loading before launching multi-node training.

AI Pretraining Data Curation on GPU Cloud: NeMo Curator, Datatrove, and FineWeb-Style Pipelines (2026 Guide)

Every pretraining team eventually hits the same wall: a terabyte-to-petabyte corpus that needs deduplication, quality filtering, PII removal, and benchmark decontamination before a single GPU touches a training run. This guide covers the full data curation stack for foundation model training, from raw web crawl to training-ready shards, using NeMo Curator, Datatrove, and FineWeb-Edu reference pipelines on GPU cloud. For what comes after the data is ready, see our continuous pretraining guide.

Why Data Curation Became GPU-Bound

At the scale of a serious pretraining run, CPU-only curation is impractical. MinHash LSH on 100B tokens takes multiple weeks on a 96-core CPU cluster. The same workload on 8x H100 with cuDF-backed Dask runs in hours. Language identification, quality classification, and PII scanning all parallelize cleanly onto GPU memory bandwidth.

Stage	CPU throughput (estimated)	GPU throughput (8x H100)
Exact dedup (SHA256 hash)	~2 TB/hr	~18 TB/hr
MinHash LSH (fuzzy dedup)	~300 GB/hr	~4 TB/hr
fastText language ID	~500 GB/hr	~8 TB/hr (GPU-batch)
Quality classifier (BERT-small)	~80 GB/hr	~1.2 TB/hr

The quality classifier stage is the bottleneck even on GPU because it requires a forward pass through a neural network for every document. Batching aggressively (batch size 256-512) and using FP16 inference close the gap significantly compared to CPU approaches.

The shift from CPU to GPU curation is not a performance tweak. At 100T token scale, it determines whether a team can iterate on their pipeline in a week or spend a month waiting for dedup to finish.

Pipeline Architecture: Eight Stages from Raw Web to Training-Ready Tokens

A production curation pipeline has eight distinct stages. Each stage reduces the corpus, and the order matters: run cheap operations first to minimize data processed by expensive ones.

Stage	Tools	Approximate data reduction
1. Ingest / format normalization	WET/WARC parsers, trafilatura, resiliparse	0-5% (malformed docs)
2. Language identification	fastText, CLD3, lingua	50-80% (keep English or target lang)
3. Exact dedup (document-level)	SHA256 hash, NeMo ExactDuplicates	10-30% of language-filtered
4. Fuzzy dedup (MinHash LSH)	NeMo MinHashDeduplicator, Datatrove	10-25% of exact-deduped
5. Quality / heuristic filtering	Gopher rules, C4 rules, line stats	20-40% of fuzzy-deduped
6. Classifier-based quality scoring	FineWebEduClassifier, fastText, BERT	30-70% of heuristic-filtered
7. PII removal	NeMo PiiModifier, presidio	Minimal doc loss
8. Benchmark decontamination	n-gram bloom filter	<1%

The combined pass rate for a full FineWeb-Edu-style pipeline applied to raw Common Crawl is 5-7%. Starting with 100T raw tokens yields 5-7T high-quality training tokens.

Raw WARCs
    |
[Language ID] ---> discard non-target language (50-80% dropped)
    |
[Exact Dedup] ---> discard duplicate documents
    |
[Fuzzy Dedup] ---> discard near-duplicates (MinHash LSH)
    |
[Heuristic Filter] ---> discard by line stats, word length, punctuation ratio
    |
[Quality Classifier] ---> discard low-scoring documents
    |
[PII Removal] ---> redact sensitive fields in-place
    |
[Decontamination] ---> remove eval set contamination
    |
Training-ready Parquet / JSONL shards

The curated shards are the upstream input to your MLOps orchestration layer. For building a reproducible pipeline that chains curation with training stages, the MLOps pipeline guide for Kubeflow, ZenML, and Metaflow covers how to wire these steps into a DAG with spot scheduling and checkpoint management.

NeMo Curator on GPU Cloud

NeMo Curator is NVIDIA's open-source data curation library built on RAPIDS (cuDF + Dask). It is the fastest option for large-scale fuzzy dedup and is actively maintained alongside the NeMo training framework.

Installation:

bash

pip install nemo-curator[cuda12x]
# verify RAPIDS/cuDF
python -c "import cudf; print('cuDF version:', cudf.__version__)"

Single-node pipeline (8x H100):

python

import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
    WordCountFilter,
    MeanWordLengthFilter,
    RepeatedLinesByCharFilter,
    PunctuationFilter,
)
from nemo_curator import ExactDuplicates
from nemo_curator.utils.distributed_utils import get_client

# Initialize a GPU-backed Dask cluster across all available GPUs
client = get_client(cluster_type="gpu")

# Load JSONL corpus from disk
dataset = DocumentDataset.read_json("./corpus/*.jsonl", add_filename=True)

# Stage 1: exact deduplication by MD5 hash
exact_dup = ExactDuplicates(
    id_field="id",
    text_field="text",
    hash_method="md5",
    results_dir="./exact_dup_results/"
)
dataset = exact_dup(dataset)

# Stage 2: MinHash fuzzy deduplication
from nemo_curator import MinHashDeduplicator

minhash = MinHashDeduplicator(
    id_field="id",
    text_field="text",
    num_hashes=128,
    char_ngrams=5,
    jaccard_threshold=0.8,
    results_dir="./minhash_results/"
)
dataset = minhash(dataset)

# Stage 3: heuristic quality filters
filters = nc.Sequential([
    WordCountFilter(min_words=50, max_words=100_000),
    MeanWordLengthFilter(min_mean_word_length=3, max_mean_word_length=10),
    RepeatedLinesByCharFilter(max_repeated_lines_fraction=0.3),
    PunctuationFilter(max_non_alpha_numeric_to_alpha_ratio=0.3),
])
dataset = filters(dataset)

# Write output to Parquet
dataset.to_parquet("./curated_output/")

Multi-node configuration (8+ nodes on Spheron):

For 64x H100 across 8 nodes, use a distributed Dask scheduler. Launch one scheduler process and one dask-cuda-worker process per node:

bash

# On the scheduler node
dask scheduler --port 8786 &

# On each worker node (run this on all 8 nodes)
NCCL_IB_HCA=mlx5_0:1 \
NCCL_IB_GID_INDEX=3 \
NCCL_NET_GDR_LEVEL=PHB \
dask-cuda-worker scheduler-host:8786 \
  --nthreads 1 \
  --memory-limit 80GiB \
  --device-memory-limit 70GiB

Then connect from your Python script:

python

from dask.distributed import Client

client = Client("scheduler-host:8786")
print(f"Connected to {len(client.scheduler_info()['workers'])} workers")

# Same pipeline code as above runs distributed across all GPUs

The NCCL_IB_HCA and NCCL_IB_GID_INDEX environment variables that the distributed training guide covers for NCCL apply to multi-node Dask GPU clusters too, since GPU-to-GPU data transfer for Dask's graph execution can use InfiniBand when available.

Datatrove Pipelines

Datatrove is a Hugging Face library for building modular text processing pipelines. It implements the exact FineWeb and FineWeb-Edu reference pipelines, making it the right tool when reproducibility against a published dataset matters. It is CPU-first but supports GPU-backed classifier steps via custom filter classes.

Single-node pipeline:

python

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.filters import (
    LanguageFilter,
    GopherQualityFilter,
    C4QualityFilter,
)
from datatrove.pipeline.dedup import MinHashDeduplicator
from datatrove.pipeline.writers import JsonlWriter

executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader(
            "s3://commoncrawl/crawl-data/CC-MAIN-2024-10/segments/",
            glob_pattern="*.warc.gz",
            compression="gzip",
        ),
        LanguageFilter(
            language_threshold=0.65,
            languages=("en",),
        ),
        GopherQualityFilter(
            min_doc_words=50,
            max_doc_words=100_000,
        ),
        C4QualityFilter(
            filter_no_terminal_punct=True,
        ),
        MinHashDeduplicator(
            num_hashes=128,
            jaccard_threshold=0.8,
        ),
        JsonlWriter("./output/"),
    ],
    tasks=64,
    workers=16,
    logging_dir="./logs/",
)
executor.run()

Multi-node via Slurm (using Slurm on GPU cloud):

python

from datatrove.executor import SlurmPipelineExecutor

executor = SlurmPipelineExecutor(
    pipeline=[
        WarcReader("/nfs/commoncrawl/", glob_pattern="*.warc.gz"),
        LanguageFilter(language_threshold=0.65, languages=("en",)),
        GopherQualityFilter(min_doc_words=50),
        C4QualityFilter(),
        MinHashDeduplicator(num_hashes=128, jaccard_threshold=0.8),
        JsonlWriter("/nfs/output/"),
    ],
    tasks=512,
    workers=64,
    partition="gpu",
    time="72:00:00",
    mem_per_cpu_gb=4,
    cpus_per_task=4,
    logging_dir="/nfs/logs/",
    slurm_logs_folder="/nfs/slurm-logs/",
)
executor.run()

Adding a GPU-backed custom filter:

python

from datatrove.pipeline.filters.base_filter import BaseFilter
from datatrove.data import Document
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

class BertQualityFilter(BaseFilter):
    def __init__(self, model_path: str, threshold: float = 0.7):
        super().__init__()
        self.threshold = threshold
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
        self.model.eval()
        if torch.cuda.is_available():
            self.model = self.model.cuda().half()

    def filter(self, doc: Document) -> bool:
        inputs = self.tokenizer(
            doc.text[:512],
            return_tensors="pt",
            truncation=True,
            padding=True,
        )
        if torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}
        with torch.no_grad():
            logits = self.model(**inputs).logits
            score = torch.softmax(logits, dim=-1)[0, 1].item()
        return score >= self.threshold

NeMo Curator vs Datatrove vs Custom Spark+RAPIDS: Feature and Cost Comparison

Dimension	NeMo Curator	Datatrove	Custom Spark+RAPIDS
GPU acceleration	Native (cuDF, RAPIDS)	Optional (custom filters)	Via RAPIDS Accelerator for Spark
Language	Python (cuDF API)	Python (modular steps)	Scala/Python
Fuzzy dedup algorithm	MinHash + LSH (cuDF)	MinHash (CPU)	MinHash via Spark MLlib
Multi-node	Dask distributed	Slurm / multiprocessing	Spark cluster
FineWeb-Edu compatibility	Partial (add classifiers)	Yes (reference impl.)	Yes
Cold-start time	10-15 min (RAPIDS init)	2-3 min	5-10 min (Spark)
Best for	10T+ token fuzzy dedup	Reproducibility / FineWeb-style	Existing Spark infra

Cost per trillion curated tokens on H100 SXM5:

At $2.57/GPU/hr on-demand ($1.52/GPU/hr spot), curating 100T raw tokens down to 5T high-quality output tokens with NeMo Curator on a 64x H100 cluster (1,536 GPU-hours) costs approximately $3,948 on-demand or $2,335 at spot rates. That works out to roughly $790/trillion output tokens on-demand, or $467/trillion on spot. Since curation workloads are embarrassingly parallel and fully restartable from any checkpoint shard, running on spot is the recommended default, with on-demand as fallback when spot capacity is unavailable.

Reproducing the FineWeb-Edu Pipeline on a Spheron Multi-Node Cluster

FineWeb-Edu is the Hugging Face dataset created by running Common Crawl through a quality classifier trained to identify educationally valuable web text. Reproducing this pipeline gives you a repeatable process for building high-quality pretraining corpora.

Cluster setup: Rent H100 SXM5 nodes on Spheron and configure them with Slurm using the setup from the Slurm on GPU cloud guide. For a FineWeb-Edu reproduction run, 8 nodes (64x H100 total) with a shared NFS mount is a practical starting point.

bash

pip install datatrove[processing]

Full FineWeb-Edu pipeline:

python

from datatrove.executor import SlurmPipelineExecutor
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.filters import (
    LanguageFilter,
    GopherQualityFilter,
    C4QualityFilter,
    FineWebEduClassifier,
)
from datatrove.pipeline.dedup import MinHashDeduplicator
from datatrove.pipeline.writers import JsonlWriter

NFS_BASE = "/nfs/fineweb-edu-repro"

executor = SlurmPipelineExecutor(
    pipeline=[
        WarcReader(
            f"{NFS_BASE}/raw-warcs/",
            glob_pattern="*.warc.gz",
        ),
        LanguageFilter(
            language_threshold=0.65,
            languages=("en",),
        ),
        GopherQualityFilter(
            min_doc_words=50,
            max_doc_words=100_000,
            max_symbol_to_word_ratio=0.1,
            max_bullet_lines_ratio=0.9,
        ),
        C4QualityFilter(
            filter_no_terminal_punct=True,
            filter_lorem_ipsum=True,
        ),
        FineWebEduClassifier(
            # Score documents 0-5; keep those >= 3 (educationally valuable)
            cutoff=3,
        ),
        MinHashDeduplicator(
            num_hashes=128,
            jaccard_threshold=0.8,
            output_folder=f"{NFS_BASE}/minhash-sigs/",
        ),
        JsonlWriter(f"{NFS_BASE}/output/"),
    ],
    tasks=512,
    workers=64,
    partition="gpu",
    time="96:00:00",
    mem_per_cpu_gb=8,
    cpus_per_task=4,
    logging_dir=f"{NFS_BASE}/logs/",
    slurm_logs_folder=f"{NFS_BASE}/slurm-logs/",
    slurm_array_parallelism=64,
)

executor.run()

Expected output: For a full CommonCrawl dump, this pipeline passes approximately 5-7% of raw tokens. A 100T token raw corpus produces 5-7T high-quality output tokens after all filtering stages.

The FineWebEduClassifier step runs on CPU by default (it calls a pretrained scorer). For large runs, you can wrap it in the BertQualityFilter pattern shown above to push classification to GPU and increase throughput from ~80 GB/hr to ~1.2 TB/hr.

Quality Classifier Training: fastText and Small Transformer Scorers

The FineWebEduClassifier covers general web quality well, but domain-specific corpora (legal, medical, scientific, code) need a custom scorer trained on in-domain positive and negative examples.

fastText classifier (CPU, high throughput):

fastText's supervised mode trains a binary classifier in minutes and runs inference at 500 GB/hr on CPU. It works well as a first-pass filter before applying a slower BERT-based scorer.

bash

# Prepare training data (one label per line, fastText format)
cat positive.txt | awk '{print "__label__pos " $0}' > train.txt
cat negative.txt | awk '{print "__label__neg " $0}' >> train.txt

# Train
fasttext supervised \
  -input train.txt \
  -output quality_classifier \
  -epoch 5 \
  -lr 0.5 \
  -wordNgrams 2 \
  -dim 256

# Evaluate
fasttext test quality_classifier.bin dev.txt

Small transformer classifier (GPU, higher accuracy):

Fine-tune distilbert-base-uncased or bert-small on 50,000 labeled examples. A sequence classification head learns to score documents 0 (low quality) to 1 (high quality). Deploy via the BertQualityFilter shown in the Datatrove section above.

Training takes under 2 hours on a single H100 for 50k examples with 3 epochs:

python

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = load_dataset("json", data_files={
    "train": "labeled_train.jsonl",
    "validation": "labeled_val.jsonl",
})

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
    )

tokenized = dataset.map(tokenize, batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

args = TrainingArguments(
    output_dir="./quality-clf",
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    learning_rate=2e-5,
    evaluation_strategy="epoch",
    fp16=True,
)

trainer = Trainer(model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["validation"])
trainer.train()
trainer.save_model("./quality-clf-final")

For multilingual scientific text, allenai/scibert_scivocab_uncased is a better starting point than generic DistilBERT.

Cost Math: GPU-Hours per Trillion Tokens of Curated Output

The following table assumes 100T raw input tokens (~400 TB JSONL) and a 5% pass rate (5T curated output tokens). Wall-clock time assumes GPU-parallel dedup and filtering with no bottlenecked single-thread stages.

Cluster	GPUs	Wall-clock time	Total GPU-hours	On-demand cost	Spot cost	Cost per trillion tokens (on-demand / spot)
Single node	8x H100 SXM5	~190 hrs	1,520	$3,906	$2,310	~$781 / ~$462
8-node cluster	64x H100 SXM5	~24 hrs	1,536	$3,948	$2,335	~$790 / ~$467
32-node cluster	256x H100 SXM5	~6 hrs	1,536	$3,948	$2,335	~$790 / ~$467

Total GPU-hours are approximately constant across cluster sizes; what changes is wall-clock time. The 32-node cluster finishes the same job in 6 hours rather than 8 days, which matters when you are iterating on pipeline parameters or racing a deadline.

For memory-intensive classifier stages, using H200 on Spheron at $4.22/GPU/hr on-demand ($1.76/GPU/hr spot) gives 141 GB HBM3 per GPU, allowing larger batch sizes and higher classifier throughput. The additional per-GPU cost is typically offset by fewer GPUs needed for the same wall-clock time.

Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.

Hand-Off to Training: Parquet, WebDataset, and Megatron-Core Integration

Once curation is done, the output format must match what your training framework expects. This is a common point where teams lose time with format conversion.

NeMo Curator output (Parquet) to Megatron-Core:

Megatron-Core's tools/preprocess_data.py expects JSONL. Convert from Parquet first:

python

import pyarrow.parquet as pq
import json

table = pq.read_table("./curated_output/")
with open("./megatron_input.jsonl", "w") as f:
    for batch in table.to_batches():
        for row in batch.to_pydict()["text"]:
            f.write(json.dumps({"text": row}) + "\n")

Then run Megatron-Core preprocessing to build binary indexed dataset files:

bash

python tools/preprocess_data.py \
  --input ./megatron_input.jsonl \
  --output-prefix ./megatron_dataset \
  --tokenizer-type GPT2BPETokenizer \
  --vocab-file gpt2-vocab.json \
  --merge-file gpt2-merges.txt \
  --append-eod \
  --workers 32

Datatrove WebDataset output for TorchTitan:

TorchTitan reads WebDataset archives natively via torchdata:

python

from datatrove.pipeline.writers import WebDatasetWriter

# Replace JsonlWriter with WebDatasetWriter in your pipeline
WebDatasetWriter(
    output_folder="/nfs/output-webdataset/",
    max_file_size=500,  # MB per shard
)

NeMo Curator native output:

Write curated output to Parquet using DocumentDataset's built-in method:

python

dataset.to_parquet("./nemo_output/")

Index file generation:

For lazy loading at multi-node training scale, always generate index files before launching training. Megatron-Core generates _idx.bin files alongside the data binary automatically during preprocessing. Verify the index exists and has the expected document count before starting any training run.

See our distributed LLM training guide for the full multi-node training setup once your data is preprocessed, and the continuous pretraining guide for domain-specific data composition strategies when mixing curated web data with domain corpora.

For cost optimization on bursty curation workloads, the spot GPU training case study covers how teams structure jobs to maximize spot fleet utilization.

Cluster Sizing and Spot Fleet Strategy for Curation Workloads

Data curation is almost perfectly suited for spot GPU instances. Each document shard is processed independently, there is no inter-GPU communication during filtering (unlike model training), and pipelines are restartable from any checkpoint shard. If a spot instance gets preempted, the pipeline resumes from the last completed shard.

This makes curation one of the highest-confidence spot workloads in the ML pipeline.

Cluster	GPUs	On-demand rate	Spot rate	100T-token job cost (on-demand)	100T-token job cost (spot)
Single node	8x H100 SXM5	$20.56/hr	$12.16/hr	$3,906	$2,310
8-node cluster	64x H100 SXM5	$164.48/hr	$97.28/hr	$3,948	$2,335
32-node cluster	256x H100 SXM5	$657.92/hr	$389.12/hr	$3,948	$2,335

For the memory-intensive quality classification stage with large BERT models, provisioning H200 on Spheron at $4.22/GPU/hr on-demand ($1.76/GPU/hr spot) gives 141 GB HBM3 per GPU vs the H100's 80 GB. The extra memory allows 3-4x larger batch sizes for classifier inference, reducing wall-clock time for the slowest pipeline stage.

The practical recommendation for most teams: run a 64-GPU H100 spot cluster for the bulk of the pipeline (dedup, language ID, heuristic filtering), and switch to a smaller 8-GPU H200 cluster for quality classification to maximize classifier throughput.

Spheron's per-second billing makes this kind of staged cluster strategy practical: you spin up the large cluster for 24 hours, release it, then spin up the small H200 cluster for the classifier pass.

Data curation at petabyte scale is a bursty compute workload: weeks of intense GPU use, then silence. Spheron's spot fleet and per-second billing make it practical to spin up 64-256 GPU curation clusters without committing to reserved capacity.
Rent H100 SXM5 → | Rent H200 on Spheron → | View all GPU pricing →

Why Data Curation Became GPU-Bound

Pipeline Architecture: Eight Stages from Raw Web to Training-Ready Tokens

NeMo Curator on GPU Cloud

Datatrove Pipelines

NeMo Curator vs Datatrove vs Custom Spark+RAPIDS: Feature and Cost Comparison

Reproducing the FineWeb-Edu Pipeline on a Spheron Multi-Node Cluster

Quality Classifier Training: fastText and Small Transformer Scorers

Cost Math: GPU-Hours per Trillion Tokens of Curated Output

Hand-Off to Training: Parquet, WebDataset, and Megatron-Core Integration

Cluster Sizing and Spot Fleet Strategy for Curation Workloads

Build what's next.