Engineering

AI Pretraining Data Curation on GPU Cloud: NeMo Curator, Datatrove, and FineWeb-Style Pipelines (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 12, 2026
AI Pretraining Data CurationNeMo Curator GPU CloudDatatrove PipelineFineWeb Pipeline GPULLM Training Data DeduplicationMinHash LSH DeduplicationcuDF GPU Data ProcessingFoundation Model Training DataQuality Classifier LLM
AI Pretraining Data Curation on GPU Cloud: NeMo Curator, Datatrove, and FineWeb-Style Pipelines (2026 Guide)

Every pretraining team eventually hits the same wall: a terabyte-to-petabyte corpus that needs deduplication, quality filtering, PII removal, and benchmark decontamination before a single GPU touches a training run. This guide covers the full data curation stack for foundation model training, from raw web crawl to training-ready shards, using NeMo Curator, Datatrove, and FineWeb-Edu reference pipelines on GPU cloud. For what comes after the data is ready, see our continuous pretraining guide.

Why Data Curation Became GPU-Bound

At the scale of a serious pretraining run, CPU-only curation is impractical. MinHash LSH on 100B tokens takes multiple weeks on a 96-core CPU cluster. The same workload on 8x H100 with cuDF-backed Dask runs in hours. Language identification, quality classification, and PII scanning all parallelize cleanly onto GPU memory bandwidth.

StageCPU throughput (estimated)GPU throughput (8x H100)
Exact dedup (SHA256 hash)~2 TB/hr~18 TB/hr
MinHash LSH (fuzzy dedup)~300 GB/hr~4 TB/hr
fastText language ID~500 GB/hr~8 TB/hr (GPU-batch)
Quality classifier (BERT-small)~80 GB/hr~1.2 TB/hr

The quality classifier stage is the bottleneck even on GPU because it requires a forward pass through a neural network for every document. Batching aggressively (batch size 256-512) and using FP16 inference close the gap significantly compared to CPU approaches.

The shift from CPU to GPU curation is not a performance tweak. At 100T token scale, it determines whether a team can iterate on their pipeline in a week or spend a month waiting for dedup to finish.

Pipeline Architecture: Eight Stages from Raw Web to Training-Ready Tokens

A production curation pipeline has eight distinct stages. Each stage reduces the corpus, and the order matters: run cheap operations first to minimize data processed by expensive ones.

StageToolsApproximate data reduction
1. Ingest / format normalizationWET/WARC parsers, trafilatura, resiliparse0-5% (malformed docs)
2. Language identificationfastText, CLD3, lingua50-80% (keep English or target lang)
3. Exact dedup (document-level)SHA256 hash, NeMo ExactDuplicates10-30% of language-filtered
4. Fuzzy dedup (MinHash LSH)NeMo MinHashDeduplicator, Datatrove10-25% of exact-deduped
5. Quality / heuristic filteringGopher rules, C4 rules, line stats20-40% of fuzzy-deduped
6. Classifier-based quality scoringFineWebEduClassifier, fastText, BERT30-70% of heuristic-filtered
7. PII removalNeMo PiiModifier, presidioMinimal doc loss
8. Benchmark decontaminationn-gram bloom filter<1%

The combined pass rate for a full FineWeb-Edu-style pipeline applied to raw Common Crawl is 5-7%. Starting with 100T raw tokens yields 5-7T high-quality training tokens.

Raw WARCs
    |
[Language ID] ---> discard non-target language (50-80% dropped)
    |
[Exact Dedup] ---> discard duplicate documents
    |
[Fuzzy Dedup] ---> discard near-duplicates (MinHash LSH)
    |
[Heuristic Filter] ---> discard by line stats, word length, punctuation ratio
    |
[Quality Classifier] ---> discard low-scoring documents
    |
[PII Removal] ---> redact sensitive fields in-place
    |
[Decontamination] ---> remove eval set contamination
    |
Training-ready Parquet / JSONL shards

The curated shards are the upstream input to your MLOps orchestration layer. For building a reproducible pipeline that chains curation with training stages, the MLOps pipeline guide for Kubeflow, ZenML, and Metaflow covers how to wire these steps into a DAG with spot scheduling and checkpoint management.

NeMo Curator on GPU Cloud

NeMo Curator is NVIDIA's open-source data curation library built on RAPIDS (cuDF + Dask). It is the fastest option for large-scale fuzzy dedup and is actively maintained alongside the NeMo training framework.

Installation:

bash
pip install nemo-curator[cuda12x]
# verify RAPIDS/cuDF
python -c "import cudf; print('cuDF version:', cudf.__version__)"

Single-node pipeline (8x H100):

python
import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
    WordCountFilter,
    MeanWordLengthFilter,
    RepeatedLinesByCharFilter,
    PunctuationFilter,
)
from nemo_curator import ExactDuplicates
from nemo_curator.utils.distributed_utils import get_client

# Initialize a GPU-backed Dask cluster across all available GPUs
client = get_client(cluster_type="gpu")

# Load JSONL corpus from disk
dataset = DocumentDataset.read_json("./corpus/*.jsonl", add_filename=True)

# Stage 1: exact deduplication by MD5 hash
exact_dup = ExactDuplicates(
    id_field="id",
    text_field="text",
    hash_method="md5",
    results_dir="./exact_dup_results/"
)
dataset = exact_dup(dataset)

# Stage 2: MinHash fuzzy deduplication
from nemo_curator import MinHashDeduplicator

minhash = MinHashDeduplicator(
    id_field="id",
    text_field="text",
    num_hashes=128,
    char_ngrams=5,
    jaccard_threshold=0.8,
    results_dir="./minhash_results/"
)
dataset = minhash(dataset)

# Stage 3: heuristic quality filters
filters = nc.Sequential([
    WordCountFilter(min_words=50, max_words=100_000),
    MeanWordLengthFilter(min_mean_word_length=3, max_mean_word_length=10),
    RepeatedLinesByCharFilter(max_repeated_lines_fraction=0.3),
    PunctuationFilter(max_non_alpha_numeric_to_alpha_ratio=0.3),
])
dataset = filters(dataset)

# Write output to Parquet
dataset.to_parquet("./curated_output/")

Multi-node configuration (8+ nodes on Spheron):

For 64x H100 across 8 nodes, use a distributed Dask scheduler. Launch one scheduler process and one dask-cuda-worker process per node:

bash
# On the scheduler node
dask scheduler --port 8786 &

# On each worker node (run this on all 8 nodes)
NCCL_IB_HCA=mlx5_0:1 \
NCCL_IB_GID_INDEX=3 \
NCCL_NET_GDR_LEVEL=PHB \
dask-cuda-worker scheduler-host:8786 \
  --nthreads 1 \
  --memory-limit 80GiB \
  --device-memory-limit 70GiB

Then connect from your Python script:

python
from dask.distributed import Client

client = Client("scheduler-host:8786")
print(f"Connected to {len(client.scheduler_info()['workers'])} workers")

# Same pipeline code as above runs distributed across all GPUs

The NCCL_IB_HCA and NCCL_IB_GID_INDEX environment variables that the distributed training guide covers for NCCL apply to multi-node Dask GPU clusters too, since GPU-to-GPU data transfer for Dask's graph execution can use InfiniBand when available.

Datatrove Pipelines

Datatrove is a Hugging Face library for building modular text processing pipelines. It implements the exact FineWeb and FineWeb-Edu reference pipelines, making it the right tool when reproducibility against a published dataset matters. It is CPU-first but supports GPU-backed classifier steps via custom filter classes.

Single-node pipeline:

python
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.filters import (
    LanguageFilter,
    GopherQualityFilter,
    C4QualityFilter,
)
from datatrove.pipeline.dedup import MinHashDeduplicator
from datatrove.pipeline.writers import JsonlWriter

executor = LocalPipelineExecutor(
    pipeline=[
        WarcReader(
            "s3://commoncrawl/crawl-data/CC-MAIN-2024-10/segments/",
            glob_pattern="*.warc.gz",
            compression="gzip",
        ),
        LanguageFilter(
            language_threshold=0.65,
            languages=("en",),
        ),
        GopherQualityFilter(
            min_doc_words=50,
            max_doc_words=100_000,
        ),
        C4QualityFilter(
            filter_no_terminal_punct=True,
        ),
        MinHashDeduplicator(
            num_hashes=128,
            jaccard_threshold=0.8,
        ),
        JsonlWriter("./output/"),
    ],
    tasks=64,
    workers=16,
    logging_dir="./logs/",
)
executor.run()

Multi-node via Slurm (using Slurm on GPU cloud):

python
from datatrove.executor import SlurmPipelineExecutor

executor = SlurmPipelineExecutor(
    pipeline=[
        WarcReader("/nfs/commoncrawl/", glob_pattern="*.warc.gz"),
        LanguageFilter(language_threshold=0.65, languages=("en",)),
        GopherQualityFilter(min_doc_words=50),
        C4QualityFilter(),
        MinHashDeduplicator(num_hashes=128, jaccard_threshold=0.8),
        JsonlWriter("/nfs/output/"),
    ],
    tasks=512,
    workers=64,
    partition="gpu",
    time="72:00:00",
    mem_per_cpu_gb=4,
    cpus_per_task=4,
    logging_dir="/nfs/logs/",
    slurm_logs_folder="/nfs/slurm-logs/",
)
executor.run()

Adding a GPU-backed custom filter:

python
from datatrove.pipeline.filters.base_filter import BaseFilter
from datatrove.data import Document
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

class BertQualityFilter(BaseFilter):
    def __init__(self, model_path: str, threshold: float = 0.7):
        super().__init__()
        self.threshold = threshold
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
        self.model.eval()
        if torch.cuda.is_available():
            self.model = self.model.cuda().half()

    def filter(self, doc: Document) -> bool:
        inputs = self.tokenizer(
            doc.text[:512],
            return_tensors="pt",
            truncation=True,
            padding=True,
        )
        if torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}
        with torch.no_grad():
            logits = self.model(**inputs).logits
            score = torch.softmax(logits, dim=-1)[0, 1].item()
        return score >= self.threshold

NeMo Curator vs Datatrove vs Custom Spark+RAPIDS: Feature and Cost Comparison

DimensionNeMo CuratorDatatroveCustom Spark+RAPIDS
GPU accelerationNative (cuDF, RAPIDS)Optional (custom filters)Via RAPIDS Accelerator for Spark
LanguagePython (cuDF API)Python (modular steps)Scala/Python
Fuzzy dedup algorithmMinHash + LSH (cuDF)MinHash (CPU)MinHash via Spark MLlib
Multi-nodeDask distributedSlurm / multiprocessingSpark cluster
FineWeb-Edu compatibilityPartial (add classifiers)Yes (reference impl.)Yes
Cold-start time10-15 min (RAPIDS init)2-3 min5-10 min (Spark)
Best for10T+ token fuzzy dedupReproducibility / FineWeb-styleExisting Spark infra

Cost per trillion curated tokens on H100 SXM5:

At $2.57/GPU/hr on-demand ($1.52/GPU/hr spot), curating 100T raw tokens down to 5T high-quality output tokens with NeMo Curator on a 64x H100 cluster (1,536 GPU-hours) costs approximately $3,948 on-demand or $2,335 at spot rates. That works out to roughly $790/trillion output tokens on-demand, or $467/trillion on spot. Since curation workloads are embarrassingly parallel and fully restartable from any checkpoint shard, running on spot is the recommended default, with on-demand as fallback when spot capacity is unavailable.

Reproducing the FineWeb-Edu Pipeline on a Spheron Multi-Node Cluster

FineWeb-Edu is the Hugging Face dataset created by running Common Crawl through a quality classifier trained to identify educationally valuable web text. Reproducing this pipeline gives you a repeatable process for building high-quality pretraining corpora.

Cluster setup: Rent H100 SXM5 nodes on Spheron and configure them with Slurm using the setup from the Slurm on GPU cloud guide. For a FineWeb-Edu reproduction run, 8 nodes (64x H100 total) with a shared NFS mount is a practical starting point.

bash
pip install datatrove[processing]

Full FineWeb-Edu pipeline:

python
from datatrove.executor import SlurmPipelineExecutor
from datatrove.pipeline.readers import WarcReader
from datatrove.pipeline.filters import (
    LanguageFilter,
    GopherQualityFilter,
    C4QualityFilter,
    FineWebEduClassifier,
)
from datatrove.pipeline.dedup import MinHashDeduplicator
from datatrove.pipeline.writers import JsonlWriter

NFS_BASE = "/nfs/fineweb-edu-repro"

executor = SlurmPipelineExecutor(
    pipeline=[
        WarcReader(
            f"{NFS_BASE}/raw-warcs/",
            glob_pattern="*.warc.gz",
        ),
        LanguageFilter(
            language_threshold=0.65,
            languages=("en",),
        ),
        GopherQualityFilter(
            min_doc_words=50,
            max_doc_words=100_000,
            max_symbol_to_word_ratio=0.1,
            max_bullet_lines_ratio=0.9,
        ),
        C4QualityFilter(
            filter_no_terminal_punct=True,
            filter_lorem_ipsum=True,
        ),
        FineWebEduClassifier(
            # Score documents 0-5; keep those >= 3 (educationally valuable)
            cutoff=3,
        ),
        MinHashDeduplicator(
            num_hashes=128,
            jaccard_threshold=0.8,
            output_folder=f"{NFS_BASE}/minhash-sigs/",
        ),
        JsonlWriter(f"{NFS_BASE}/output/"),
    ],
    tasks=512,
    workers=64,
    partition="gpu",
    time="96:00:00",
    mem_per_cpu_gb=8,
    cpus_per_task=4,
    logging_dir=f"{NFS_BASE}/logs/",
    slurm_logs_folder=f"{NFS_BASE}/slurm-logs/",
    slurm_array_parallelism=64,
)

executor.run()

Expected output: For a full CommonCrawl dump, this pipeline passes approximately 5-7% of raw tokens. A 100T token raw corpus produces 5-7T high-quality output tokens after all filtering stages.

The FineWebEduClassifier step runs on CPU by default (it calls a pretrained scorer). For large runs, you can wrap it in the BertQualityFilter pattern shown above to push classification to GPU and increase throughput from ~80 GB/hr to ~1.2 TB/hr.

Quality Classifier Training: fastText and Small Transformer Scorers

The FineWebEduClassifier covers general web quality well, but domain-specific corpora (legal, medical, scientific, code) need a custom scorer trained on in-domain positive and negative examples.

fastText classifier (CPU, high throughput):

fastText's supervised mode trains a binary classifier in minutes and runs inference at 500 GB/hr on CPU. It works well as a first-pass filter before applying a slower BERT-based scorer.

bash
# Prepare training data (one label per line, fastText format)
cat positive.txt | awk '{print "__label__pos " $0}' > train.txt
cat negative.txt | awk '{print "__label__neg " $0}' >> train.txt

# Train
fasttext supervised \
  -input train.txt \
  -output quality_classifier \
  -epoch 5 \
  -lr 0.5 \
  -wordNgrams 2 \
  -dim 256

# Evaluate
fasttext test quality_classifier.bin dev.txt

Small transformer classifier (GPU, higher accuracy):

Fine-tune distilbert-base-uncased or bert-small on 50,000 labeled examples. A sequence classification head learns to score documents 0 (low quality) to 1 (high quality). Deploy via the BertQualityFilter shown in the Datatrove section above.

Training takes under 2 hours on a single H100 for 50k examples with 3 epochs:

python
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = load_dataset("json", data_files={
    "train": "labeled_train.jsonl",
    "validation": "labeled_val.jsonl",
})

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
    )

tokenized = dataset.map(tokenize, batched=True)

model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2
)

args = TrainingArguments(
    output_dir="./quality-clf",
    num_train_epochs=3,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    learning_rate=2e-5,
    evaluation_strategy="epoch",
    fp16=True,
)

trainer = Trainer(model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["validation"])
trainer.train()
trainer.save_model("./quality-clf-final")

For multilingual scientific text, allenai/scibert_scivocab_uncased is a better starting point than generic DistilBERT.

Cost Math: GPU-Hours per Trillion Tokens of Curated Output

The following table assumes 100T raw input tokens (~400 TB JSONL) and a 5% pass rate (5T curated output tokens). Wall-clock time assumes GPU-parallel dedup and filtering with no bottlenecked single-thread stages.

ClusterGPUsWall-clock timeTotal GPU-hoursOn-demand costSpot costCost per trillion tokens (on-demand / spot)
Single node8x H100 SXM5~190 hrs1,520$3,906$2,310~$781 / ~$462
8-node cluster64x H100 SXM5~24 hrs1,536$3,948$2,335~$790 / ~$467
32-node cluster256x H100 SXM5~6 hrs1,536$3,948$2,335~$790 / ~$467

Total GPU-hours are approximately constant across cluster sizes; what changes is wall-clock time. The 32-node cluster finishes the same job in 6 hours rather than 8 days, which matters when you are iterating on pipeline parameters or racing a deadline.

For memory-intensive classifier stages, using H200 on Spheron at $4.22/GPU/hr on-demand ($1.76/GPU/hr spot) gives 141 GB HBM3 per GPU, allowing larger batch sizes and higher classifier throughput. The additional per-GPU cost is typically offset by fewer GPUs needed for the same wall-clock time.

Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing for live rates.

Hand-Off to Training: Parquet, WebDataset, and Megatron-Core Integration

Once curation is done, the output format must match what your training framework expects. This is a common point where teams lose time with format conversion.

NeMo Curator output (Parquet) to Megatron-Core:

Megatron-Core's tools/preprocess_data.py expects JSONL. Convert from Parquet first:

python
import pyarrow.parquet as pq
import json

table = pq.read_table("./curated_output/")
with open("./megatron_input.jsonl", "w") as f:
    for batch in table.to_batches():
        for row in batch.to_pydict()["text"]:
            f.write(json.dumps({"text": row}) + "\n")

Then run Megatron-Core preprocessing to build binary indexed dataset files:

bash
python tools/preprocess_data.py \
  --input ./megatron_input.jsonl \
  --output-prefix ./megatron_dataset \
  --tokenizer-type GPT2BPETokenizer \
  --vocab-file gpt2-vocab.json \
  --merge-file gpt2-merges.txt \
  --append-eod \
  --workers 32

Datatrove WebDataset output for TorchTitan:

TorchTitan reads WebDataset archives natively via torchdata:

python
from datatrove.pipeline.writers import WebDatasetWriter

# Replace JsonlWriter with WebDatasetWriter in your pipeline
WebDatasetWriter(
    output_folder="/nfs/output-webdataset/",
    max_file_size=500,  # MB per shard
)

NeMo Curator native output:

Write curated output to Parquet using DocumentDataset's built-in method:

python
dataset.to_parquet("./nemo_output/")

Index file generation:

For lazy loading at multi-node training scale, always generate index files before launching training. Megatron-Core generates _idx.bin files alongside the data binary automatically during preprocessing. Verify the index exists and has the expected document count before starting any training run.

See our distributed LLM training guide for the full multi-node training setup once your data is preprocessed, and the continuous pretraining guide for domain-specific data composition strategies when mixing curated web data with domain corpora.

For cost optimization on bursty curation workloads, the spot GPU training case study covers how teams structure jobs to maximize spot fleet utilization.

Cluster Sizing and Spot Fleet Strategy for Curation Workloads

Data curation is almost perfectly suited for spot GPU instances. Each document shard is processed independently, there is no inter-GPU communication during filtering (unlike model training), and pipelines are restartable from any checkpoint shard. If a spot instance gets preempted, the pipeline resumes from the last completed shard.

This makes curation one of the highest-confidence spot workloads in the ML pipeline.

ClusterGPUsOn-demand rateSpot rate100T-token job cost (on-demand)100T-token job cost (spot)
Single node8x H100 SXM5$20.56/hr$12.16/hr$3,906$2,310
8-node cluster64x H100 SXM5$164.48/hr$97.28/hr$3,948$2,335
32-node cluster256x H100 SXM5$657.92/hr$389.12/hr$3,948$2,335

For the memory-intensive quality classification stage with large BERT models, provisioning H200 on Spheron at $4.22/GPU/hr on-demand ($1.76/GPU/hr spot) gives 141 GB HBM3 per GPU vs the H100's 80 GB. The extra memory allows 3-4x larger batch sizes for classifier inference, reducing wall-clock time for the slowest pipeline stage.

The practical recommendation for most teams: run a 64-GPU H100 spot cluster for the bulk of the pipeline (dedup, language ID, heuristic filtering), and switch to a smaller 8-GPU H200 cluster for quality classification to maximize classifier throughput.

Spheron's per-second billing makes this kind of staged cluster strategy practical: you spin up the large cluster for 24 hours, release it, then spin up the small H200 cluster for the classifier pass.

Data curation at petabyte scale is a bursty compute workload: weeks of intense GPU use, then silence. Spheron's spot fleet and per-second billing make it practical to spin up 64-256 GPU curation clusters without committing to reserved capacity.

Rent H100 SXM5 → | Rent H200 on Spheron → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.