How much faster is Parabricks than CPU-based GATK on a 30x WGS sample?

On a 30x whole-genome sequencing sample, Parabricks on a single H100 SXM5 reduces the full FASTQ-to-VCF pipeline from roughly 24-30 hours on a 32-core CPU cluster to under 30-45 minutes. That is a 40-50x wall-clock speedup. The exact number depends on read depth and which pipeline steps you accelerate; BWA-MEM alignment and HaplotypeCaller show the highest per-step gains.

Is NVIDIA Parabricks free to use?

Parabricks is free for academic and non-commercial research use under the latest NVIDIA Parabricks licensing terms. Commercial use requires a license from NVIDIA. The Parabricks container is available on NGC (NVIDIA GPU Cloud) and pulls without cost for academic users. Verify the current license terms at the NVIDIA Parabricks product page before deploying for clinical or commercial workflows.

What GPU VRAM does Parabricks need for a 30x WGS sample?

A single 30x WGS sample typically peaks at 35-40GB VRAM. An H100 SXM5 (80GB) can run two concurrent samples only when each stays below ~38GB peak, since CUDA context overhead consumes 1-2GB and leaves roughly 78GB usable. For guaranteed headroom with two concurrent samples, use an H200 SXM5 (141GB). The B200 at 192GB can hold up to four 30x samples concurrently, making it the best option for high-throughput clinical batch pipelines.

Can I run Parabricks on spot GPU instances?

Yes, with caveats. Parabricks pipelines are not inherently checkpoint-aware, so a spot interruption mid-alignment will restart that stage from scratch. Structure your pipeline so each stage writes outputs (BAM, GVCF, VCF) to persistent storage before moving to the next stage. If a spot instance is reclaimed during HaplotypeCaller, the BAM is already saved and the next instance resumes from that point.

How does Spheron GPU cloud pricing compare to AWS HealthOmics for 1,000 WGS samples?

AWS HealthOmics charges approximately $5-8 per 30x WGS sample for a managed GATK workflow. On Spheron, using a single H100 node processing 2 concurrent samples at roughly $4.21/hr, a 45-minute pipeline per sample pair costs about $1.58 per sample. For 1,000 samples, that is roughly $1,580 vs $5,000-8,000 on HealthOmics, a 3-5x cost reduction before accounting for spot discounts.

NVIDIA Parabricks on GPU Cloud: 50x Faster Genomics Pipelines on H100 and B200 (2026 Guide)

CPU-based GATK on a 30x whole-genome sequencing sample takes 24-30 hours on a 32-core server. Parabricks on a single H100 SXM5 rental on Spheron does the same job in under 45 minutes. For diagnostic labs where turnaround time directly affects clinical decisions, and for pharma genomics teams processing hundreds of samples weekly, that difference is not academic. It changes what pipelines are feasible at what cost. See the GPU cost optimization playbook for the general framework on when GPU compute pays off versus CPU alternatives.

This guide covers what Parabricks actually accelerates (and what it does not), concrete deployment steps on Spheron H100 and B200 instances, hardware sizing for different batch scales, and the cost-per-genome math compared to AWS HealthOmics and on-prem clusters.

What Parabricks Accelerates in 2026

Parabricks 4.x accelerates specific pipeline stages, not the entire genomics workflow. Understanding which stages run on GPU versus CPU is critical for accurate performance expectations.

GPU-accelerated stages:

BWA-MEM alignment (fq2bam)
GATK HaplotypeCaller (germline SNP/indel calling)
DeepVariant (CNN-based variant calling)
STAR RNA-seq alignment
Mutect2 (somatic variant calling)
BQSR (Base Quality Score Recalibration, partial)

Not GPU-accelerated (runs on CPU through standard GATK):

VQSR (Variant Quality Score Recalibration)
GATK GenotypeGVCFs at scale (multi-sample joint calling)
Most tertiary analysis steps (annotation, filtering)
GATK CNV workflows

Parabricks does not replace your full pipeline. It accelerates the compute-heavy alignment and variant calling stages. Downstream annotation with Ensembl VEP, ANNOVAR, or filtering with GATK FilterVariantTranches still runs on CPU. Keep that in mind when projecting end-to-end turnaround times to clinical stakeholders.

Pipeline Stage	Tool	Parabricks Accelerated?	CPU Baseline (30x WGS)	GPU Time (H100)
Alignment	BWA-MEM	Yes	8-12 hr	8-12 min
Sorting	Samtools sort	Partial	1-2 hr	15-20 min
Mark Duplicates	GATK MarkDuplicates	Yes	1-2 hr	5-10 min
BQSR	GATK BQSR	Partial	2-4 hr	10-15 min
Variant Calling	GATK HaplotypeCaller	Yes	8-14 hr	8-15 min
CNN Variant Calling	DeepVariant	Yes	4-6 hr	3-5 min
Joint Genotyping	GATK GVCFtyper	No	varies	CPU-bound
Annotation	VEP/ANNOVAR	No	30-60 min	CPU-bound

Pipeline Walkthrough: FASTQ to VCF on a 30x WGS Sample

The pbrun germline command runs the full FASTQ-to-VCF pipeline in a single call, handling alignment, sorting, deduplication, BQSR, and HaplotypeCaller internally:

bash

pbrun germline \
  --ref /mnt/nvme/hg38/Homo_sapiens_assembly38.fasta \
  --in-fq /mnt/nvme/input/sample_R1.fastq.gz \
         /mnt/nvme/input/sample_R2.fastq.gz \
  --out-bam /mnt/nvme/output/sample.bam \
  --out-variants /mnt/nvme/output/sample.vcf \
  --num-gpus 1 \
  --logfile /mnt/nvme/logs/sample_pbrun.log

Timing breakdown per stage on H100 SXM5, H200, and B200, based on published Parabricks benchmarks for a 30x WGS sample:

Stage	H100 SXM5 (80GB)	H200 SXM5 (141GB)	B200 SXM6 (192GB)
BWA-MEM alignment	~10 min	~8 min	~6 min
Sort + MarkDups + BQSR	~15 min	~12 min	~9 min
HaplotypeCaller	~10 min	~9 min	~7 min
Total FASTQ to VCF	~35-45 min	~29-35 min	~22-28 min
CPU baseline (32-core)	~24 hr	-	-

The B200's advantage here comes from its higher HBM bandwidth (8 TB/s versus 3.35 TB/s on H100), which directly accelerates BWA-MEM seed extension and HaplotypeCaller local assembly. For a deep dive on how memory bandwidth drives this speedup, the NVIDIA H100 vs H200 comparison covers the same memory-bound performance dynamic in detail.

Hardware Sizing: VRAM, CPU Ratio, and NVMe Requirements

VRAM requirements per concurrent sample:

30x WGS: ~35-40GB peak VRAM per pipeline instance
H100 SXM5 (80GB): 1 concurrent sample with headroom; 2 concurrent samples only if each pipeline stays below ~38GB peak (CUDA context overhead consumes 1-2GB, leaving ~78GB usable). Use H200 for guaranteed two-sample headroom.
H200 SXM5 (141GB): 3 concurrent samples
B200 SXM6 (192GB): 4 concurrent samples

CPU-to-GPU ratio: Parabricks uses CPU threads for I/O, sorting, and non-GPU stages. Target at least 16 CPU threads per active GPU. For a single H100 node, 16-32 CPU cores is sufficient. More cores don't meaningfully help once the GPU becomes the bottleneck.

NVMe staging requirements: A 30x WGS sample generates roughly 50-80GB compressed FASTQ input, 120-150GB intermediate BAM, and 2-4GB final VCF. Running from NFS or object storage bucket mounts adds 20-40% to runtime due to I/O wait. Always stage to local NVMe first.

Scale	GPU Choice	Concurrent Samples	NVMe Needed	CPU Threads
Single clinical sample	H100 SXM5	1-2	300GB	16-32
Small batch (10-20 samples/day)	H100 SXM5	2	500GB	32
Medium batch (50-100 samples/day)	4x H100	8	2TB	128
High-throughput (500+ samples/day)	8x B200	32	8TB	256

Deploying Parabricks on Spheron GPU Cloud

Step 1: Provision the instance

For standard 30x WGS workloads, an H100 SXM5 node covers 1-2 concurrent samples. For high-throughput clinical batch pipelines processing multiple samples simultaneously, a B200 SXM6 node handles up to 4 concurrent 30x samples with room for overhead. Choose on-demand for time-sensitive work where spot interruptions would affect turnaround guarantees; use spot instances for overnight batch runs where you can implement stage-level checkpointing.

Step 2: Pull the Parabricks container

NVIDIA Parabricks is available on NGC. For academic use, no license key is required:

bash

docker pull nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1

Parabricks requires CUDA 12.x drivers. Spheron H100 and B200 nodes ship with CUDA 12.x, but verify the minimum supported CUDA version for your specific Parabricks release before pulling. The nvidia-smi output on the instance will show the installed driver version.

For commercial deployment, configure your NVIDIA license server endpoint per the Parabricks licensing documentation before running the container. Do not assume the academic free tier applies to clinical diagnostic or commercial pharmaceutical workflows.

Step 3: Set up NVMe staging

Mount the NVMe volume and organize your directory structure:

/mnt/nvme/
  ref/          # hg38 reference + BWA index (~11GB)
  input/        # Input FASTQs (staged from object storage)
  output/       # BAM, VCF outputs (copy to persistent storage after run)
  logs/

Download the hg38 reference genome and GATK resource bundle files from the Broad Institute's public GCS bucket. The full reference with BWA index files is approximately 11GB. Stage input FASTQs from your object storage bucket to the NVMe volume before launching the pipeline.

Step 4: Run the pipeline

Run pbrun germline as shown in the pipeline walkthrough above. For two concurrent samples on a single H100, launch two separate pbrun germline processes simultaneously, each with --num-gpus 1. This works when each pipeline stays below ~38GB peak VRAM; CUDA context overhead consumes 1-2GB, leaving roughly 78GB usable on the 80GB device. Samples at the upper end of the VRAM range or with higher-than-average read depths may push past this limit. For reliable two-sample concurrency with headroom, use an H200 SXM5 rental (141GB).

Step 5: Copy outputs and release

After pbrun germline completes, copy the VCF and BAM files to persistent object storage or a Spheron persistent volume. For spot instances, automate this copy in a post-run script that fires on pipeline exit. The persistent volume stays attached across spot instance reclamations, so subsequent instances pick up from the last completed stage.

Scaling: Batch Processing 1,000 Genomes

For high-throughput batch workloads, Nextflow and Snakemake both support GPU resource declarations that map well to Parabricks pipelines.

Nextflow pattern:

The accelerator directive's type value is executor-specific. For AWS Batch use type: 'gpu'; for Google Batch use the GPU model name (e.g., type: 'nvidia-tesla-a100'). Substitute the correct value for your target executor.

nextflow

process PARABRICKS_GERMLINE {
    container 'nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1'
    accelerator 1, type: 'gpu'   // AWS Batch: 'gpu'; Google Batch: GPU model name e.g. 'nvidia-tesla-a100'
    
    input:
    tuple val(sample_id), path(fastq_r1), path(fastq_r2)
    path ref
    
    output:
    tuple val(sample_id), path("${sample_id}.vcf"), path("${sample_id}.bam")
    
    script:
    """
    pbrun germline \
      --ref ${ref}/hg38.fasta \
      --in-fq ${fastq_r1} ${fastq_r2} \
      --out-bam ${sample_id}.bam \
      --out-variants ${sample_id}.vcf \
      --num-gpus 1
    """
}

Snakemake pattern:

python

rule parabricks_germline:
    input:
        r1="input/{sample}_R1.fastq.gz",
        r2="input/{sample}_R2.fastq.gz",
        ref="ref/hg38.fasta"
    output:
        bam="output/{sample}.bam",
        vcf="output/{sample}.vcf"
    resources:
        nvidia_gpu=1
    shell:
        """
        pbrun germline \
          --ref {input.ref} \
          --in-fq {input.r1} {input.r2} \
          --out-bam {output.bam} \
          --out-variants {output.vcf} \
          --num-gpus 1
        """

Batch throughput math for 1,000 samples:

With 2 concurrent samples per H100 node at 45 minutes each, a single node processes approximately 64 samples per day. For 1,000 samples in one day, you need 16 concurrent H100 nodes. Stage the reference genome once per node at launch and reuse it across all samples on that node. Use spot instances with stage-level checkpointing: each pbrun call writes its output (BAM, GVCF, VCF) to NVMe before the next stage starts. If a spot instance is reclaimed mid-HaplotypeCaller, the BAM from the completed alignment step is already written, and the next instance continues from there.

Cost Per Genome: Spheron vs AWS HealthOmics vs On-Prem

Live Spheron pricing as of 09 May 2026:

Option	Cost per 30x WGS Sample	Notes
Spheron H100 SXM5 on-demand	~$1.58	$4.21/hr, 2 concurrent samples, 45 min run
Spheron H100 SXM5 spot	~$0.30	$0.80/hr spot, same 2-concurrent setup
Spheron B200 SXM6 on-demand	~$0.73	$7.00/hr, 4 concurrent samples, 25 min run
Spheron B200 SXM6 spot	~$0.18	$1.71/hr spot, 4-concurrent 25 min
AWS HealthOmics	~$5-8	Managed GATK workflow, no GPU control
On-prem cluster (32-core)	~$12-32	24-hr runtime, amortized TCO

The math for the H100 on-demand case: $4.21/hr * 0.75 hr / 2 concurrent samples = $1.58 per sample. For spot, the same calculation at $0.80/hr gives $0.30 per sample. For 1,000 samples, that is $1,580 on-demand or $300 on spot, versus $5,000-8,000 on AWS HealthOmics.

Pricing fluctuates based on GPU availability. The prices above are based on 09 May 2026 and may have changed. Check current GPU pricing → for live rates.

The on-prem comparison is rougher. A 32-core server at cloud-equivalent $0.50-1.00/core/hr with 24 hours of runtime per sample works out to $384-768 in compute-equivalent cost, not counting hardware amortization, power, cooling, or staff time. For labs running 10+ samples per week, the Spheron spot math often wins even against owned hardware.

For the general framework on when spot GPU compute pays off versus dedicated capacity, the GPU cost optimization playbook covers the reservation and spot decision logic in detail.

When CPU Is Still Cheaper

Parabricks is not always the right tool. Being honest about the break-even math is more useful than overselling the GPU case.

CPU wins when you're running targeted panels. For 50-500 gene panels or WES (whole exome sequencing), total CPU runtime drops to 30-60 minutes on a modern server. GPU setup overhead and per-hour cost stop making sense when the CPU job is nearly as fast.

Low-coverage sequencing. At 5x WGS depth, a 32-core CPU finishes in 2-4 hours. The H100 saves maybe 2 hours. Whether that's worth the hourly rate depends on how many samples you're running.

Exome-only workflows. A full WES pipeline on CPU takes 4-6 hours versus 10-15 minutes on H100. The break-even is roughly 5 exomes per week. Below that threshold, a cloud CPU VM is likely cheaper in total.

Single samples with no urgency. One 30x WGS sample on a spot CPU instance runs GATK in 24 hours for around $1-2 total. On-demand H100 for 45 minutes costs about $3.16 ($4.21/hr × 0.75 hr). If results can wait a day, CPU spot is the cheaper path.

The rule of thumb: GPU acceleration becomes clearly cost-positive when you're running 10 or more 30x WGS samples per batch, or when the 24-hour CPU turnaround time is clinically unacceptable and turnaround directly affects patient care decisions.

Clinical genomics labs processing batches of 30x WGS samples can cut per-genome compute cost to under $0.30 on Spheron H100 spot instances, compared to $5-8 on AWS HealthOmics managed workflows.
Rent H100 → | Rent B200 → | View all GPU pricing →