Engineering

NVIDIA Parabricks on GPU Cloud: 50x Faster Genomics Pipelines on H100 and B200 (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderMay 9, 2026
NVIDIA Parabricks GPU CloudNVIDIA ParabricksGPU Accelerated GenomicsParabricks Deployment GuideWhole Genome SequencingWGS GPU PipelineVariant Calling GPUDeepVariant GPUCost Per GenomeAWS HealthOmics Alternative
NVIDIA Parabricks on GPU Cloud: 50x Faster Genomics Pipelines on H100 and B200 (2026 Guide)

CPU-based GATK on a 30x whole-genome sequencing sample takes 24-30 hours on a 32-core server. Parabricks on a single H100 SXM5 rental on Spheron does the same job in under 45 minutes. For diagnostic labs where turnaround time directly affects clinical decisions, and for pharma genomics teams processing hundreds of samples weekly, that difference is not academic. It changes what pipelines are feasible at what cost. See the GPU cost optimization playbook for the general framework on when GPU compute pays off versus CPU alternatives.

This guide covers what Parabricks actually accelerates (and what it does not), concrete deployment steps on Spheron H100 and B200 instances, hardware sizing for different batch scales, and the cost-per-genome math compared to AWS HealthOmics and on-prem clusters.

What Parabricks Accelerates in 2026

Parabricks 4.x accelerates specific pipeline stages, not the entire genomics workflow. Understanding which stages run on GPU versus CPU is critical for accurate performance expectations.

GPU-accelerated stages:

  • BWA-MEM alignment (fq2bam)
  • GATK HaplotypeCaller (germline SNP/indel calling)
  • DeepVariant (CNN-based variant calling)
  • STAR RNA-seq alignment
  • Mutect2 (somatic variant calling)
  • BQSR (Base Quality Score Recalibration, partial)

Not GPU-accelerated (runs on CPU through standard GATK):

  • VQSR (Variant Quality Score Recalibration)
  • GATK GenotypeGVCFs at scale (multi-sample joint calling)
  • Most tertiary analysis steps (annotation, filtering)
  • GATK CNV workflows

Parabricks does not replace your full pipeline. It accelerates the compute-heavy alignment and variant calling stages. Downstream annotation with Ensembl VEP, ANNOVAR, or filtering with GATK FilterVariantTranches still runs on CPU. Keep that in mind when projecting end-to-end turnaround times to clinical stakeholders.

Pipeline StageToolParabricks Accelerated?CPU Baseline (30x WGS)GPU Time (H100)
AlignmentBWA-MEMYes8-12 hr8-12 min
SortingSamtools sortPartial1-2 hr15-20 min
Mark DuplicatesGATK MarkDuplicatesYes1-2 hr5-10 min
BQSRGATK BQSRPartial2-4 hr10-15 min
Variant CallingGATK HaplotypeCallerYes8-14 hr8-15 min
CNN Variant CallingDeepVariantYes4-6 hr3-5 min
Joint GenotypingGATK GVCFtyperNovariesCPU-bound
AnnotationVEP/ANNOVARNo30-60 minCPU-bound

Pipeline Walkthrough: FASTQ to VCF on a 30x WGS Sample

The pbrun germline command runs the full FASTQ-to-VCF pipeline in a single call, handling alignment, sorting, deduplication, BQSR, and HaplotypeCaller internally:

bash
pbrun germline \
  --ref /mnt/nvme/hg38/Homo_sapiens_assembly38.fasta \
  --in-fq /mnt/nvme/input/sample_R1.fastq.gz \
         /mnt/nvme/input/sample_R2.fastq.gz \
  --out-bam /mnt/nvme/output/sample.bam \
  --out-variants /mnt/nvme/output/sample.vcf \
  --num-gpus 1 \
  --logfile /mnt/nvme/logs/sample_pbrun.log

Timing breakdown per stage on H100 SXM5, H200, and B200, based on published Parabricks benchmarks for a 30x WGS sample:

StageH100 SXM5 (80GB)H200 SXM5 (141GB)B200 SXM6 (192GB)
BWA-MEM alignment~10 min~8 min~6 min
Sort + MarkDups + BQSR~15 min~12 min~9 min
HaplotypeCaller~10 min~9 min~7 min
Total FASTQ to VCF~35-45 min~29-35 min~22-28 min
CPU baseline (32-core)~24 hr--

The B200's advantage here comes from its higher HBM bandwidth (8 TB/s versus 3.35 TB/s on H100), which directly accelerates BWA-MEM seed extension and HaplotypeCaller local assembly. For a deep dive on how memory bandwidth drives this speedup, the NVIDIA H100 vs H200 comparison covers the same memory-bound performance dynamic in detail.

Hardware Sizing: VRAM, CPU Ratio, and NVMe Requirements

VRAM requirements per concurrent sample:

  • 30x WGS: ~35-40GB peak VRAM per pipeline instance
  • H100 SXM5 (80GB): 1 concurrent sample with headroom; 2 concurrent samples only if each pipeline stays below ~38GB peak (CUDA context overhead consumes 1-2GB, leaving ~78GB usable). Use H200 for guaranteed two-sample headroom.
  • H200 SXM5 (141GB): 3 concurrent samples
  • B200 SXM6 (192GB): 4 concurrent samples

CPU-to-GPU ratio: Parabricks uses CPU threads for I/O, sorting, and non-GPU stages. Target at least 16 CPU threads per active GPU. For a single H100 node, 16-32 CPU cores is sufficient. More cores don't meaningfully help once the GPU becomes the bottleneck.

NVMe staging requirements: A 30x WGS sample generates roughly 50-80GB compressed FASTQ input, 120-150GB intermediate BAM, and 2-4GB final VCF. Running from NFS or object storage bucket mounts adds 20-40% to runtime due to I/O wait. Always stage to local NVMe first.

ScaleGPU ChoiceConcurrent SamplesNVMe NeededCPU Threads
Single clinical sampleH100 SXM51-2300GB16-32
Small batch (10-20 samples/day)H100 SXM52500GB32
Medium batch (50-100 samples/day)4x H10082TB128
High-throughput (500+ samples/day)8x B200328TB256

Deploying Parabricks on Spheron GPU Cloud

Step 1: Provision the instance

For standard 30x WGS workloads, an H100 SXM5 node covers 1-2 concurrent samples. For high-throughput clinical batch pipelines processing multiple samples simultaneously, a B200 SXM6 node handles up to 4 concurrent 30x samples with room for overhead. Choose on-demand for time-sensitive work where spot interruptions would affect turnaround guarantees; use spot instances for overnight batch runs where you can implement stage-level checkpointing.

Step 2: Pull the Parabricks container

NVIDIA Parabricks is available on NGC. For academic use, no license key is required:

bash
docker pull nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1

Parabricks requires CUDA 12.x drivers. Spheron H100 and B200 nodes ship with CUDA 12.x, but verify the minimum supported CUDA version for your specific Parabricks release before pulling. The nvidia-smi output on the instance will show the installed driver version.

For commercial deployment, configure your NVIDIA license server endpoint per the Parabricks licensing documentation before running the container. Do not assume the academic free tier applies to clinical diagnostic or commercial pharmaceutical workflows.

Step 3: Set up NVMe staging

Mount the NVMe volume and organize your directory structure:

/mnt/nvme/
  ref/          # hg38 reference + BWA index (~11GB)
  input/        # Input FASTQs (staged from object storage)
  output/       # BAM, VCF outputs (copy to persistent storage after run)
  logs/

Download the hg38 reference genome and GATK resource bundle files from the Broad Institute's public GCS bucket. The full reference with BWA index files is approximately 11GB. Stage input FASTQs from your object storage bucket to the NVMe volume before launching the pipeline.

Step 4: Run the pipeline

Run pbrun germline as shown in the pipeline walkthrough above. For two concurrent samples on a single H100, launch two separate pbrun germline processes simultaneously, each with --num-gpus 1. This works when each pipeline stays below ~38GB peak VRAM; CUDA context overhead consumes 1-2GB, leaving roughly 78GB usable on the 80GB device. Samples at the upper end of the VRAM range or with higher-than-average read depths may push past this limit. For reliable two-sample concurrency with headroom, use an H200 SXM5 rental (141GB).

Step 5: Copy outputs and release

After pbrun germline completes, copy the VCF and BAM files to persistent object storage or a Spheron persistent volume. For spot instances, automate this copy in a post-run script that fires on pipeline exit. The persistent volume stays attached across spot instance reclamations, so subsequent instances pick up from the last completed stage.

Scaling: Batch Processing 1,000 Genomes

For high-throughput batch workloads, Nextflow and Snakemake both support GPU resource declarations that map well to Parabricks pipelines.

Nextflow pattern:

The accelerator directive's type value is executor-specific. For AWS Batch use type: 'gpu'; for Google Batch use the GPU model name (e.g., type: 'nvidia-tesla-a100'). Substitute the correct value for your target executor.

nextflow
process PARABRICKS_GERMLINE {
    container 'nvcr.io/nvidia/clara/clara-parabricks:4.3.1-1'
    accelerator 1, type: 'gpu'   // AWS Batch: 'gpu'; Google Batch: GPU model name e.g. 'nvidia-tesla-a100'
    
    input:
    tuple val(sample_id), path(fastq_r1), path(fastq_r2)
    path ref
    
    output:
    tuple val(sample_id), path("${sample_id}.vcf"), path("${sample_id}.bam")
    
    script:
    """
    pbrun germline \
      --ref ${ref}/hg38.fasta \
      --in-fq ${fastq_r1} ${fastq_r2} \
      --out-bam ${sample_id}.bam \
      --out-variants ${sample_id}.vcf \
      --num-gpus 1
    """
}

Snakemake pattern:

python
rule parabricks_germline:
    input:
        r1="input/{sample}_R1.fastq.gz",
        r2="input/{sample}_R2.fastq.gz",
        ref="ref/hg38.fasta"
    output:
        bam="output/{sample}.bam",
        vcf="output/{sample}.vcf"
    resources:
        nvidia_gpu=1
    shell:
        """
        pbrun germline \
          --ref {input.ref} \
          --in-fq {input.r1} {input.r2} \
          --out-bam {output.bam} \
          --out-variants {output.vcf} \
          --num-gpus 1
        """

Batch throughput math for 1,000 samples:

With 2 concurrent samples per H100 node at 45 minutes each, a single node processes approximately 64 samples per day. For 1,000 samples in one day, you need 16 concurrent H100 nodes. Stage the reference genome once per node at launch and reuse it across all samples on that node. Use spot instances with stage-level checkpointing: each pbrun call writes its output (BAM, GVCF, VCF) to NVMe before the next stage starts. If a spot instance is reclaimed mid-HaplotypeCaller, the BAM from the completed alignment step is already written, and the next instance continues from there.

Cost Per Genome: Spheron vs AWS HealthOmics vs On-Prem

Live Spheron pricing as of 09 May 2026:

OptionCost per 30x WGS SampleNotes
Spheron H100 SXM5 on-demand~$1.58$4.21/hr, 2 concurrent samples, 45 min run
Spheron H100 SXM5 spot~$0.30$0.80/hr spot, same 2-concurrent setup
Spheron B200 SXM6 on-demand~$0.73$7.00/hr, 4 concurrent samples, 25 min run
Spheron B200 SXM6 spot~$0.18$1.71/hr spot, 4-concurrent 25 min
AWS HealthOmics~$5-8Managed GATK workflow, no GPU control
On-prem cluster (32-core)~$12-3224-hr runtime, amortized TCO

The math for the H100 on-demand case: $4.21/hr * 0.75 hr / 2 concurrent samples = $1.58 per sample. For spot, the same calculation at $0.80/hr gives $0.30 per sample. For 1,000 samples, that is $1,580 on-demand or $300 on spot, versus $5,000-8,000 on AWS HealthOmics.

Pricing fluctuates based on GPU availability. The prices above are based on 09 May 2026 and may have changed. Check current GPU pricing → for live rates.

The on-prem comparison is rougher. A 32-core server at cloud-equivalent $0.50-1.00/core/hr with 24 hours of runtime per sample works out to $384-768 in compute-equivalent cost, not counting hardware amortization, power, cooling, or staff time. For labs running 10+ samples per week, the Spheron spot math often wins even against owned hardware.

For the general framework on when spot GPU compute pays off versus dedicated capacity, the GPU cost optimization playbook covers the reservation and spot decision logic in detail.

When CPU Is Still Cheaper

Parabricks is not always the right tool. Being honest about the break-even math is more useful than overselling the GPU case.

CPU wins when you're running targeted panels. For 50-500 gene panels or WES (whole exome sequencing), total CPU runtime drops to 30-60 minutes on a modern server. GPU setup overhead and per-hour cost stop making sense when the CPU job is nearly as fast.

Low-coverage sequencing. At 5x WGS depth, a 32-core CPU finishes in 2-4 hours. The H100 saves maybe 2 hours. Whether that's worth the hourly rate depends on how many samples you're running.

Exome-only workflows. A full WES pipeline on CPU takes 4-6 hours versus 10-15 minutes on H100. The break-even is roughly 5 exomes per week. Below that threshold, a cloud CPU VM is likely cheaper in total.

Single samples with no urgency. One 30x WGS sample on a spot CPU instance runs GATK in 24 hours for around $1-2 total. On-demand H100 for 45 minutes costs about $3.16 ($4.21/hr × 0.75 hr). If results can wait a day, CPU spot is the cheaper path.

The rule of thumb: GPU acceleration becomes clearly cost-positive when you're running 10 or more 30x WGS samples per batch, or when the 24-hour CPU turnaround time is clinically unacceptable and turnaround directly affects patient care decisions.

Clinical genomics labs processing batches of 30x WGS samples can cut per-genome compute cost to under $0.30 on Spheron H100 spot instances, compared to $5-8 on AWS HealthOmics managed workflows.

Rent H100 → | Rent B200 → | View all GPU pricing →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.