Engineering

Synthetic Data Generation on GPU Cloud: Distilabel, Augmentoolkit, and Nemotron-4 for LLM Fine-Tuning (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 16, 2026
Synthetic Data Generation LLMDistilabel GPU CloudAugmentoolkit DeploymentNemotron-4 Synthetic DataFine-Tuning Dataset GenerationLLM Instruction TuningConstitutional AI PipelineMinHash DeduplicationGPU Cloud AI Training
Synthetic Data Generation on GPU Cloud: Distilabel, Augmentoolkit, and Nemotron-4 for LLM Fine-Tuning (2026 Guide)

Licensed datasets are contaminated, expensive to license, and increasingly risky to use in production models. Synthetic data has filled that gap: you generate your training corpus using an LLM, filter it with a judge, and iterate until quality is high enough to actually move your downstream metrics. This guide walks through the full stack for synthetic instruction data production using Distilabel, Augmentoolkit, and Nemotron-4 340B on GPU cloud. For what comes next after you have the data, see our LLM fine-tuning guide for 2026 and, if your goal is pretraining data, the pretraining curation pipeline guide covering NeMo Curator and Datatrove.

Why Synthetic Data Took Over

Licensing and Contamination Risk

The Pile, RedPajama, and Common Crawl all contain substantial fractions of benchmark test sets mixed into pretraining data. MMLU questions appear in C4. GSM8K problems surface in filtered web text. When you fine-tune on licensed internet text, you cannot guarantee your eval sets are clean. The problem gets worse for instruction data: every major human-annotated dataset from 2020-2023 was scraped using contractors who sourced examples from the same web text your models already saw.

Synthetic data side-steps this problem. You control what goes into your seed instructions, you control which model generates responses, and you can run explicit decontamination against every eval set you care about before any training run starts.

Cost of Human Annotation at Scale

At 100K examples, human annotation costs range from $40K to $400K depending on task complexity, annotator expertise, and quality tiers. At 1M examples, those numbers become prohibitive for any team outside a handful of large labs. A Distilabel pipeline running on GPU cloud can generate 1M diverse instruction-response pairs for a few hundred dollars in compute.

The annotation quality tradeoff is real: human annotators catch things LLM judges miss, especially for subtle tone and factual accuracy in niche domains. But for most instruction-following improvements, filtered synthetic data at 10x the scale beats small, expensive human datasets.

The Alignment Tax

RLHF requires preference labels: two responses, one preferred over the other, rated by humans. At scale, that requires a large annotation workforce, strict inter-annotator agreement protocols, and constant quality audits. Constitutional AI (CAI) and judge-and-revise pipelines reduce this cost by 10-100x. The LLM generates a critique of its own response based on a set of principles, then revises. A reward model scores the before and after. No human in the loop beyond the initial constitution design.

DimensionLicensed/Real DataSynthetic Data
Cost at 1M examples$40K-$400K (annotation)$200-$1,000 (compute)
Contamination riskHighControllable
Label consistencyVariable (annotator drift)Deterministic
Legal riskHigh (copyright, ToS)Low (model output)
Iteration speedWeeks per revisionHours per revision

Synthetic Data Pipeline Taxonomy

PatternWhat It DoesPrimary ToolWhen To Use
Self-InstructSeed instructions prompt an LLM to generate new diverse instructionsDistilabelGeneral instruction tuning
Evol-InstructIteratively rewrites instructions to be harder or more constrainedDistilabel EvolInstructWizardLM-style complexity injection
Constitutional AILLM self-critiques and revises against a principle setDistilabel UltraFeedbackAlignment-focused datasets
Judge-and-ReviseGenerator + separate judge model scores each responseDistilabel + ArmoRMQuality-gated output selection
Doc-to-QARaw documents converted to question-answer pairsAugmentoolkitDomain-specific fine-tuning

Self-Instruct works by seeding the LLM with 100-200 hand-written example instructions, then prompting it to generate new, topically diverse variations. FLAN task descriptions work well as seeds. The LLM is instructed not to copy the seed verbatim but to vary format, topic, and difficulty.

Evol-Instruct takes an existing instruction and applies a mutation: make it more specific, add constraints, increase depth requirements, or reframe as a multi-step task. After several rounds of evolution, the resulting dataset contains instructions at multiple difficulty levels with minimal surface similarity to the original seeds.

Constitutional AI provides the LLM with a list of principles (the "constitution") such as "be helpful, harmless, and honest" at varying levels of specificity. The model generates a first response, critiques it against the constitution, then revises. You can chain multiple critique-revision rounds. The final output is the revised response plus the critique chain, which gives you preference data as a byproduct.

Judge-and-Revise separates the generator from the evaluator. A small, fast model (Llama-4-Scout) generates N candidate responses per instruction. A separate, higher-quality reward model (ArmoRM-Llama3-8B, Nemotron-4-340B-Reward) scores each candidate. Only the top-K responses by reward score enter the training set.

Doc-to-QA is Augmentoolkit's core pattern: chunk a raw document, generate a question from the chunk, verify the question is answerable from the chunk, generate a final answer, and filter any QA pair the verifier rejects. This pattern is irreplaceable for domain-specific fine-tuning on proprietary documentation.

Distilabel Architecture

Distilabel (v1.x from Argilla) organizes synthetic data production into four abstractions:

Steps are typed, composable units that either transform data (GeneratorStep produces new rows) or label it (GlobalStep processes all rows at once). Each step declares its input and output columns, allowing Distilabel to validate the pipeline DAG before any inference runs.

LLMs are backends that steps call for generation or scoring. The vLLM backend (distilabel.llms.vLLM) connects to a locally-served vLLM endpoint. The InferenceEndpointsLLM connects to Hugging Face endpoints. Any OpenAI-compatible API works through OpenAILLM. Each LLM backend handles batching, async calls, and retry logic independently.

Tasks are higher-level step wrappers that pre-package common patterns: TextGeneration for generating responses to instructions, UltraFeedback for rating responses on multiple dimensions, EvolInstruct for difficulty-based instruction mutation.

Pipeline is the DAG that connects steps via the >> operator and manages execution. It handles concurrency, routing batches between steps, and writing output Parquet files.

The Argilla feedback loop sits outside the pipeline: after generation, push the dataset to an Argilla server, have domain experts review a sample, export filtered rows back to Parquet, and feed those rows into the next pipeline iteration as improved seeds.

python
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.llms import vLLM

with Pipeline(
    name="instruction-generation",
    description="Generate and score instruction-response pairs",
) as pipeline:
    load_seeds = LoadDataFromDicts(
        data=[
            {"instruction": seed} for seed in seed_instructions
        ],
        batch_size=64,
    )

    generate = TextGeneration(
        llm=vLLM(
            model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
            generation_kwargs={
                "temperature": 0.8,
                "max_new_tokens": 512,
            },
        ),
        num_generations=4,
    )

    evaluate = UltraFeedback(
        llm=vLLM(
            model="nvidia/Nemotron-4-340B-Reward",
            generation_kwargs={"max_new_tokens": 256},
        ),
        aspect="overall-rating",
    )

    load_seeds >> generate >> evaluate

distiset = pipeline.run(use_cache=True)
distiset.push_to_hub("your-org/instruction-dataset-v1")

The num_generations=4 on TextGeneration produces four candidate responses per instruction. UltraFeedback scores each candidate, and the resulting dataset contains all four with scores, letting downstream filtering pick the top-1 by score.

Hands-On: 100K Instruction Dataset with Distilabel and Llama 4 on Spheron H100

Step 1: Provision the Node

Log in to app.spheron.ai, select an 8x H100 SXM5 instance, and SSH in. On bare-metal H100 SXM5 instances on Spheron, CUDA 12.4 is pre-installed on most images. Install the Python stack:

bash
# On your Spheron H100 node
python3 --version  # verify Python 3.11+
pip install distilabel[vllm] datasets argilla "huggingface_hub>=0.23"

# Verify vLLM installed and GPU visible
python3 -c "import torch; print(torch.cuda.device_count(), 'GPUs available')"
# Expected: 8 GPUs available

Step 2: Serve the Generator Model

Llama-4-Scout has 109B total parameters (17B active across 16 experts) and needs roughly 218 GB VRAM for BF16 weights. Across 8x H100 (640 GB total), tensor parallelism of 4 or 8 is appropriate. Note that tp=4 on only 4x H100 (320 GB) is feasible for short context but tight once KV cache and activation memory are included.

bash
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --port 8000 &

# Wait for server ready
sleep 60
curl http://localhost:8000/v1/models

For the judge/reward model in the same pipeline, you have two options: run it on the same node using the remaining GPUs (tp=4 for each model), or spin up a second node for the reward model. The second node approach avoids VRAM contention at large batch sizes.

Step 3: Configure the Pipeline

python
import json
from pathlib import Path
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.llms import OpenAILLM  # vLLM serves an OpenAI-compatible endpoint

# Load FLAN task seeds or your custom seed set
seeds = json.loads(Path("seed_instructions.json").read_text())

with Pipeline(
    name="100k-instruction-run",
    description="100K instruction dataset with Llama-4-Scout generation",
) as pipeline:
    load_seeds = LoadDataFromDicts(
        data=[{"instruction": s} for s in seeds],
        batch_size=128,
    )

    generate = TextGeneration(
        llm=OpenAILLM(
            base_url="http://localhost:8000/v1",
            api_key="local",
            model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
            generation_kwargs={
                "temperature": 0.8,
                "max_new_tokens": 512,
                "top_p": 0.95,
            },
        ),
        num_generations=2,
        output_mappings={"generation": "response"},
    )

    load_seeds >> generate

distiset = pipeline.run(
    use_cache=True,
    storage_path="./output/instruction-100k",
)

Distilabel writes intermediate Parquet files per batch, so if the run is interrupted you can resume from the last checkpoint with use_cache=True.

Step 4: Run and Monitor

bash
python3 run_pipeline.py

# Monitor GPU utilization in another terminal
watch -n 5 nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.free --format=csv

Expected throughput on 8x H100 SXM5:

Generator ModelGPUs (tp size)Tokens/secTime for 100K rows (avg 256 tokens out)
Llama-4-Scout 17B4x H100 (tp=4)~9,000~48 min
Llama-4-Maverick 17B4x H100 (tp=4)~4,000~107 min
Nemotron-4 340B (FP8)8x H100 (tp=8)~800~8.9 hrs

For a 100K dataset with a fast generator like Llama-4-Scout, the whole run takes under an hour. With Nemotron-4 as the generator, budget for a full day.

Augmentoolkit: QA Datasets from Raw Documents

Augmentoolkit solves a different problem than Distilabel. You have 500 pages of internal Kubernetes documentation, a proprietary codebase, or a niche technical manual. You want a fine-tuning dataset from that content. Self-Instruct does not help because it needs seed instructions, not raw text. Augmentoolkit is built exactly for this.

The pipeline chunks raw documents, prompts an LLM to generate a question based on each chunk, verifies that the question is actually answerable from that chunk alone (not from world knowledge), generates an answer, and filters any QA pair the verifier rejects.

Installation:

bash
git clone https://github.com/e-p-armstrong/augmentoolkit
cd augmentoolkit  # requires Python 3.11
bash linux.sh     # launches the web interface with all dependencies

Configuration (YAML):

yaml
# config.yaml
path: "./input_documents"
output: "./output_qa"
chunk_size: 1500
overlap: 200
model:
  name: "meta-llama/Llama-4-Scout-17B-16E-Instruct"
  api_base: "http://localhost:8000/v1"
  api_key: "local"
  max_tokens: 512
  temperature: 0.7
question_types:
  - factual
  - reasoning
  - multi-hop
filter_threshold: 0.7

Run:

bash
# The web interface opens automatically after bash linux.sh completes setup.
# For CLI use without the interface:
python3 -m venv .venv && source .venv/bin/activate
pip install uv && uv pip install -r requirements.txt
python run_augmentoolkit.py

A realistic example: converting 500 pages of Kubernetes documentation into a fine-tuning corpus takes about 90 minutes on a single H100 with Llama-4-Scout as the generator, and produces roughly 8K-14K filtered QA pairs depending on document density and chunk size.

The Augmentoolkit output is JSONL. Load it into Distilabel for additional quality filtering before use:

python
from distilabel.steps import LoadDataFromDisk

# Augmentoolkit output is a directory of JSONL files
dataset = LoadDataFromDisk(dataset_path="./output_qa")

From here, run the same MinHash dedup and reward model scoring described in the Quality Filtering section below.

Self-Hosting Nemotron-4 340B as a Generator Model

Nemotron-4 340B Instruct (nvidia/Nemotron-4-340B-Instruct) is one of the strongest open-weights generator and reward models available for synthetic data production. The reward model variant (nvidia/Nemotron-4-340B-Reward) is the preferred judge for scoring instruction-following quality. Both require serious GPU resources to serve.

Note: Nemotron-4 340B is a different model from Nemotron Ultra 253B, which is NVIDIA's newer reasoning-focused model. For deploying Nemotron Ultra 253B, see the Nemotron Ultra deployment guide. This section focuses on Nemotron-4 340B specifically for synthetic data generation pipelines.

VRAM Math

PrecisionModel Size8x H100 (640 GB)4x B200 (768 GB)2x B300 (576 GB)
BF16680 GBDoes not fitFits (88 GB headroom for KV cache)Does not fit
FP8340 GBFits (300 GB headroom)FitsFits
INT4 (AWQ)170 GBFits easilyFits easilyFits

For short context (up to ~4K tokens), 4x B200 in BF16 is viable. The 88 GB headroom covers the KV cache at those context lengths. For 8K+ context windows or batch sizes above 16, use 8x B200 or FP8 quantization on 4x B200 to avoid OOM in production.

For production quality, use FP8 on H100 nodes or BF16 on B200 nodes. INT4 saves VRAM but degrades reward model scoring accuracy in ways that compound with dataset scale.

To rent B200 GPUs on Spheron for BF16 Nemotron-4 serving, you need a 4x or 8x B200 instance depending on context length requirements. For Blackwell architecture details, FP4/FP8 paths, and NVLink topology relevant to multi-GPU serving, see the B200 complete guide.

Tensor Parallelism Configuration

bash
# On 8x H100 nodes (FP8)
vllm serve nvidia/Nemotron-4-340B-Instruct \
  --tensor-parallel-size 8 \
  --dtype float8 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.92 \
  --port 8000

# On 4x B200 nodes (BF16, short context)
vllm serve nvidia/Nemotron-4-340B-Instruct \
  --tensor-parallel-size 4 \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.88 \
  --port 8000

Throughput Tuning

SettingEffectRecommended Value
--max-num-seqsMaximum concurrent sequences64-128 for generation, 256 for reward scoring
--enable-chunked-prefillReduces latency for long promptsEnable for prompts > 2K tokens
--max-num-batched-tokensBatch token budget8192-16384

Expected token throughput at FP8 on 8x H100:

Batch SizeTokens/sec (generation)Tokens/sec (reward scoring)
16~500~2,400
64~800~6,000
128~850~8,000

Reward scoring (short outputs, single score token) runs much faster than generation. Plan your pipeline accordingly: you can score 10x faster than you generate, so reward scoring is rarely the bottleneck.

Quality Filtering at Scale

Raw synthetic data from any pipeline contains duplicates, low-quality responses, and sometimes eval set contamination. Run these three filters before any fine-tuning run.

MinHash Deduplication

python
from datasketch import MinHash, MinHashLSH

def make_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for gram in _ngrams(text.lower(), n=5):
        m.update(" ".join(gram).encode("utf-8"))
    return m

def _ngrams(words: str, n: int):
    tokens = words.split()
    return [tokens[i:i+n] for i in range(len(tokens) - n + 1)]

# Build LSH index
lsh = MinHashLSH(threshold=0.8, num_perm=128)

deduplicated = []
for idx, row in enumerate(dataset):
    mh = make_minhash(row["instruction"] + " " + row["response"])
    key = f"row_{idx}"
    if not lsh.query(mh):
        lsh.insert(key, mh)
        deduplicated.append(row)

print(f"Removed {len(dataset) - len(deduplicated)} duplicates "
      f"({100*(len(dataset)-len(deduplicated))/(len(dataset) or 1):.1f}%)")

For GPU-accelerated deduplication at billion-token scale, see the NeMo Curator and Datatrove pipeline guide which covers cuDF-backed MinHash LSH that runs 10-20x faster on GPU.

Reward Model Scoring

Score every row with ArmoRM-Llama3-8B-v0.1 (a strong reward model that runs on a single A100) and discard the bottom 20th percentile:

python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

reward_model = AutoModelForSequenceClassification.from_pretrained(
    "RLHFlow/ArmoRM-Llama3-8B-v0.1",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("RLHFlow/ArmoRM-Llama3-8B-v0.1")

def score_batch(instructions: list[str], responses: list[str]) -> list[float]:
    messages = [
        [{"role": "user", "content": i}, {"role": "assistant", "content": r}]
        for i, r in zip(instructions, responses)
    ]
    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", padding=True, truncation=True
    ).to("cuda")
    with torch.no_grad():
        scores = reward_model(**inputs).score.float().cpu().tolist()
    return scores

def batches(seq, size):
    return [seq[i:i+size] for i in range(0, len(seq), size)]

# Filter bottom 20th percentile
all_scores = []
for batch in batches(dataset, size=32):
    all_scores.extend(score_batch(
        [row["instruction"] for row in batch],
        [row["response"] for row in batch],
    ))

threshold = sorted(all_scores)[int(0.20 * len(all_scores))]
filtered = [row for row, score in zip(dataset, all_scores) if score >= threshold]

Perplexity Filtering

High-perplexity responses are usually incoherent, repetitive, or off-topic. Score every row with a small reference model (GPT-2 or Llama-3.2-1B) and flag anything above 3x the median perplexity:

python
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch, math

ppl_model = GPT2LMHeadModel.from_pretrained("gpt2-large").cuda().eval()
ppl_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2-large")

def compute_perplexity(text: str) -> float:
    enc = ppl_tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    enc = {k: v.cuda() for k, v in enc.items()}
    with torch.no_grad():
        loss = ppl_model(**enc, labels=enc["input_ids"]).loss
    return math.exp(loss.item())

ppls = [compute_perplexity(row["response"]) for row in filtered]
if not ppls:
    final_dataset = []
else:
    median_ppl = sorted(ppls)[len(ppls) // 2]
    cutoff = 3 * median_ppl
    final_dataset = [
        row for row, ppl in zip(filtered, ppls) if ppl <= cutoff
    ]
print(f"Final dataset: {len(final_dataset)} rows")

Cost Math: Synthetic Dataset Generation Across GPU Tiers

The table below uses live pricing fetched from the Spheron API on 16 May 2026. Times are estimates for 1M rows using Nemotron-4 340B FP8 as the generator, which is the most demanding configuration. With Llama-4-Scout as the generator, times drop by 10x.

GPUOn-Demand ($/GPU/hr)Spot ($/GPU/hr)8-GPU Node/hr (on-demand / spot)Time for 1M rowsOn-Demand TotalSpot Total
H100 SXM5$3.90$1.66$31.20 / $13.28~72 hrs~$2,246~$956
H200 SXM5$4.62$1.92$36.96 / $15.36~60 hrs~$2,218~$922
B200 SXM6$7.16$1.71$57.28 / $13.68~45 hrs~$2,578~$616

B200 spot pricing is particularly attractive for synthetic data jobs. The generation run is stateless up to the output Parquet files, so a preemption just requires restarting from the last completed batch. B200 spot at $13.68/node/hr undercuts H100 on-demand ($31.20/hr) by more than half for the same Nemotron-4 run because B200 nodes complete the job faster.

For a Llama-4-Scout-based generation run (no Nemotron-4), time drops to roughly 6-8 hours on H100, bringing the cost of a 1M-row instruction dataset to around $190-250 on H100 on-demand, or under $110 on H100 spot.

Spheron vs hyperscaler comparison for a 1M-row Nemotron-4 dataset run (72 hrs, 8x H100):

ProviderInstance8x GPU rate (est.)72-hr total
SpheronH100 SXM5 on-demand$31.20/hr~$2,246
SpheronH100 SXM5 spot$13.28/hr~$956
AWSp4de.24xlarge (8x A100 80GB)~$27.45/hr~$1,976
GCPa3-megagpu-8g (8x H100)~$88.49/hr~$6,371

Against GCP's H100 rate, Spheron on-demand is about 65% cheaper. Spot pricing makes the gap wider: $956 for the full 72-hour run versus $6,371 on GCP on-demand. AWS p4de uses A100 80GB hardware (a different generation), so the comparison is not directly equivalent; on spot, the same Spheron H100 run costs ~$956 versus $1,976 on AWS A100 on-demand.

AWS and GCP rates are on-demand pricing fetched from public pricing pages on 16 May 2026; check provider pages for current rates.

Pricing fluctuates based on GPU availability. The prices above are based on 16 May 2026 and may have changed. Check current GPU pricing → for live rates.

Compliance and Data Provenance

The EU AI Act Article 10 requires that training data for high-risk AI systems be "subject to appropriate data governance and management practices." Synthetic data does not exempt you from this requirement. If your model falls under a high-risk category (medical, legal, financial, HR), you need to document your synthetic data pipeline with the same rigor as licensed datasets.

What to log for compliance audits:

FieldWhat to Record
Generator modelHugging Face model ID + git commit hash of weights (or SHA256 of model files)
Prompt templatesVersion-controlled prompt files with hashes
Seed dataSource, license, and any filtering applied to seed instructions
Filter thresholdsExact values used for reward score cutoff, PPL cutoff, dedup threshold
Output row hashesSHA256 of every training row (enables row-level provenance)
Eval decontaminationWhich eval sets were checked, threshold used, number of rows removed

Data cards: Hugging Face's data card format provides a standard schema for documenting training datasets. For synthetic corpora, the most important fields are curation_rationale, source_data (seeds and generator model), and annotations (judge model and scoring methodology).

Lineage tracking at row level: Assign each generated row a UUID at generation time and log generator model, prompt template version, seed instruction ID, and reward score. Store this metadata in a separate provenance Parquet file alongside the training Parquet. When you later add new rows or remove contaminated rows, log those changes with timestamps. The goal is an audit trail that lets you reconstruct the exact state of the training dataset at any point in time.

LLM-generated content disclosure: Some jurisdictions require disclosure when training data contains LLM-generated content. Track the fraction of synthetic vs. human-sourced rows in your dataset metadata and include this in any model card for the downstream model.

Production Checklist

Before passing a synthetic dataset to any fine-tuning run, verify these checks:

CheckTool/MethodPass Criteria
Schema validationPydantic model on every rowZero validation errors
DeduplicationMinHash LSH, threshold 0.8Less than 1% duplicates remaining
Eval contamination13-gram overlap vs MMLU, GSM8K, HumanEvalJaccard below 0.1 for all pairs
Reward score floorArmoRM-Llama3-8B or Nemotron-4-340B-RewardBottom 20th percentile removed
PPL filterGPT-2 perplexityNo row above 3x median
Format sanityLoad into tokenizer, count malformed rowsLess than 0.01% malformed
Fine-tune sanity100-step warmup run, check loss curveLoss decreasing, no NaN

Once your dataset passes these checks, the next step depends on your training objective. For reasoning tasks using verifiable rewards, the GRPO fine-tuning guide covers how to train reasoning models where your synthetic dataset provides the instruction prompts and your reward function validates the generated reasoning chains. For standard instruction following, the full LLM fine-tuning workflow covers LoRA, QLoRA, and full fine-tuning configurations with Axolotl and Unsloth. For picking between LoRA variants (DoRA, GaLore, PiSSA, VERA), the PEFT methods 2026 guide compares each approach for synthetic-data fine-tuning jobs.


Synthetic data generation runs are bursty by nature: you need 8 GPUs for 12 hours to produce a dataset, then nothing until the next iteration. Spheron's on-demand GPU billing with no minimum commitments is built for exactly this pattern. Rent H100 GPUs for your next dataset run or compare B200 pricing for Nemotron-4 340B generation.

Start generating on Spheron →

STEPS / 06

Quick Setup Guide

  1. Provision an H100 node on Spheron

    Log in to app.spheron.ai, select an 8x H100 SXM5 instance, choose on-demand billing, and SSH in. Install CUDA 12.4, Python 3.11, and the vLLM package.

  2. Install Distilabel and configure a pipeline

    Run `pip install distilabel[vllm]`. Define a Pipeline with a TextGeneration step backed by a vLLM LLM pointing to your local model server.

  3. Generate the base dataset

    Pass a seed instruction list (FLAN tasks or a custom seed set) into the pipeline and run `pipeline.run()`. Distilabel writes Parquet files to the output directory automatically.

  4. Apply quality filtering

    Run MinHash deduplication via datasketch, score outputs with a reward model (e.g., ArmoRM-Llama3), and discard responses below the 20th percentile reward score.

  5. Check for eval contamination

    Run 13-gram MinHash overlap detection against MMLU, GSM8K, and HumanEval test splits. Remove any training examples with Jaccard similarity above 0.1.

  6. Fine-tune on the cleaned dataset

    Load the cleaned Parquet files with Axolotl or Unsloth, apply LoRA (r=64, alpha=128) on your target model, and run a sanity-test forward pass before the full training run.

FAQ / 05

Frequently Asked Questions

Distilabel is an open-source framework by Argilla that chains LLM generators and judges into typed data pipelines. It handles task definition, prompt templating, generator calls, quality scoring, and Argilla export in a single pipeline object. For synthetic data generation it is more reproducible and auditable than ad-hoc LLM-as-a-Judge scripts.

Nemotron-4 340B in BF16 requires roughly 680 GB of VRAM. That needs at least 8x H100 80GB (640 GB total, so FP8 quantization is required) or 4x B200 192GB (768 GB total, enough for BF16 at comfortable headroom). For serving with vLLM, use tensor parallelism tp=8 on H100 nodes or tp=4 on B200 nodes.

Cost depends on generation length and the model used. A 1M-row instruction dataset averaging 512 output tokens at Llama-4-Scout generation rates on a Spheron H100 typically runs for 8-16 hours on a single 8x H100 node. At Spheron's on-demand H100 pricing, the total compute cost is typically $250-500 for most instruction-tuning scales. Check [current GPU pricing](/pricing/) for exact rates.

Augmentoolkit targets a specific use case: converting raw text documents (technical docs, books, code) into question-answer pairs for domain-specific fine-tuning. Distilabel is a general-purpose pipeline framework for any synthetic data pattern. Use Augmentoolkit when your source material is unstructured documents; use Distilabel when you are implementing instruction generation, evol-instruct, or constitutional AI pipelines from scratch.

Run n-gram overlap detection between your generated dataset and every eval benchmark you plan to use (MMLU, GSM8K, HumanEval, etc.). MinHash with 13-gram shingles and a Jaccard threshold of 0.1 is the standard approach. Dedup your training set against eval sets before any fine-tuning run, not after.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.