Engineering

GPU Cloud for AI Drug Discovery: Deploy AlphaFold 3, Boltz-2, and RoseTTAFold All-Atom (2026)

Back to BlogWritten by Mitrasish, Co-founderApr 28, 2026
AlphaFold 3 GPU CloudBoltz-2 DeploymentRoseTTAFold All-AtomAI Drug Discovery GPUProtein Structure Prediction GPU CloudProtein FoldingVirtual Screening GPU CloudHigh-Throughput Virtual ScreeningStructure-Based Drug DesignESMFoldChai-1Molecular ModelingBioinformatics GPU CloudProtein-Ligand PredictionMolecular Docking
GPU Cloud for AI Drug Discovery: Deploy AlphaFold 3, Boltz-2, and RoseTTAFold All-Atom (2026)

Biotech startups scaling from exploratory predictions to high-throughput virtual screening hit a cost wall fast. At $0.50 per prediction from the Isomorphic Labs API, a 100,000-compound screening campaign costs $50,000 before you've validated a single hit. The math gets worse at 1M compounds. That cost shock, combined with open weight releases for AlphaFold 3 in 2024-2025 and Boltz-2 going fully open source, drove a shift toward self-hosted structure prediction pipelines in 2026. See the GPU cost optimization playbook for the general framework on when the API-to-self-host switch makes economic sense.

Why Biotech Moved Off the API in 2026

The API is convenient for one-off predictions. It breaks at scale for several reasons.

API economics don't survive screening. The Isomorphic Labs API runs roughly $0.50 per AlphaFold 3 prediction. ColabFold's public server enforces rate limits that cap throughput below 500 predictions per day. A 1M-compound virtual screen at these rates is not practical, either economically or logistically.

Proprietary compound confidentiality. Every compound SMILES string or novel protein sequence sent to a third-party API leaves your firewall. For discovery-stage programs with unreleased scaffold chemistry, that's an IP exposure problem. In a competitive drug discovery program, the sequence of a novel target variant is itself a trade secret.

Throughput ceiling. Self-hosted pipelines scale horizontally. Add more A100 spot instances and your throughput scales linearly. API quotas don't.

Data residency. EU biotech labs handling clinical-phase compound data face GDPR obligations around where patient-linked molecular data is processed. In-region bare-metal compute with no shared tenancy sidesteps that compliance complexity entirely.

The 2026 Protein Structure Model Landscape

Five models cover most drug discovery workflows. Each has different open-source status, capabilities, and VRAM baselines.

AlphaFold 3 (DeepMind / Isomorphic Labs)

AlphaFold 3 predicts the structure of proteins, DNA, RNA, small molecules, and ions, including their interactions. Weights were released by DeepMind in late 2024 under a custom license that requires accepting terms at the official repository before downloading. The weights are not freely redistributable; you must obtain them directly from DeepMind's release. VRAM requirements: 40GB for monomers under ~1000 residues, 80GB for large multimeric complexes and nucleic acid targets.

Boltz-2 (MIT / Recursion)

Boltz-2 is the second-generation model from the MIT and Recursion Boltz authors, distinct from the original Boltz (v1) release. It supports all biomolecule types including proteins, DNA, RNA, small molecules, and covalent modifications. It is fully open source under MIT license with no access restrictions. The inference pipeline is faster than AlphaFold 3 for most use cases, the community is active, and the weights download automatically via the boltz download command. VRAM baseline: 40GB for standard monomers and protein-ligand complexes.

RoseTTAFold All-Atom (Baker Lab, UW)

RoseTTAFold All-Atom (RFAA) achieves all-atom accuracy across proteins and small molecules, including covalent ligand interactions. Open weights from the Baker Lab. VRAM minimum: 48GB, making an A100 80GB the practical entry point.

Chai-1 (Chai Discovery)

Chai-1 handles multi-chain complexes with high accuracy. Open weights are available from Chai Discovery. Suited for antibody-antigen and protein-protein docking scenarios.

ESMFold (Meta / EvolutionaryScale)

ESMFold is the fastest option. It runs single-sequence prediction (no MSA required) and completes in seconds per structure. VRAM: 15-20GB, so an L40S or even an A40 works. The tradeoff is accuracy: ESMFold is calibrated for well-characterized protein families and misses some structural details that MSA-based models catch. Use it as a pre-screening step to eliminate unfoldable or disordered sequences before running full AF3 or Boltz-2 on shortlisted candidates.

Workload Types and GPU Requirements

WorkloadDescriptionMin VRAMRecommended GPU
Monomer pre-screenSingle chain, ESMFold no-MSA16 GBL40S, A40
Monomer standardSingle chain, 500-1500 aa40 GBA100 80GB
Protein-ligand complexProtein + small molecule40-60 GBA100 80GB
Protein-protein dimerTwo chains60-80 GBA100 80GB, H100
Antibody-antigenFull IgG + antigen80 GBA100 80GB, H100
Large multimeric assembly4+ chains or 2000+ total aa80-141 GBH100, H200
Nucleic acid complexesProtein + DNA/RNA80 GBH100

The A100 80GB covers the bulk of drug discovery workloads: standard monomers, protein-ligand docking, and antibody-antigen pairs. H100 and H200 become necessary for very large assemblies above 1500 residues or RNA-protein complexes with long nucleotide chains.

GPU Memory Math for Protein Structure Prediction

The VRAM equation for protein structure models has three components: fixed model weights, activation memory that scales with sequence length, and MSA representation memory that scales with MSA depth.

VRAM = model_weights + (seq_len^1.5 * recycling_steps * dtype_bytes) + msa_depth_overhead

Model weights are fixed per model: Boltz-2 is roughly 2-4GB, RFAA is similar, AlphaFold 3 is approximately 8GB. The variable cost comes from the attention mechanisms processing the sequence and MSA.

Model256 aa512 aa1024 aa
Boltz-2~8 GB~15 GB~38 GB
AlphaFold 3~12 GB~22 GB~55 GB

These figures are at BF16 with default recycling steps (3 for AF3, 1-3 for Boltz-2). For sequences above 1024 residues or MSA depths above 512 sequences, add 20-40% headroom.

The scaling logic mirrors what happens with LLM KV caches: memory grows super-linearly with sequence length (or context length). See GPU memory requirements for LLMs for the parallel math on the LLM side, where the same residue-length versus context-length analogy holds.

Throughput vs Latency: Choosing the Right Instance

Two distinct workflows drive different instance choices.

Interactive or exploratory work. A structural biologist submitting one or two targets and waiting for results needs low latency and guaranteed availability. Use on-demand A100 instances or on-demand H100 for large complexes. On-demand gives you the instance immediately and holds it as long as you need. Expect 10-60 minutes per prediction depending on sequence length and recycling steps.

Batch virtual screening. A 10,000-compound or 1M-compound screen is asynchronous by definition. You submit the queue and check results hours or days later. Spot instances are the right choice here. Spot A100 80GB on Spheron runs at $0.45/hr versus $1.04/hr on-demand, a 57% saving. With a properly checkpointed MSA cache (see below), a spot interruption costs at most one prediction's worth of work, not the full queue. See serverless vs on-demand vs reserved billing explained for the billing model decision framework, and batch LLM inference on GPU cloud for checkpoint-and-resume patterns that apply equally to structure prediction queues.

Deploy Boltz-2 on Spheron

Step 1: Provision an instance. Rent an A100 80GB on Spheron for standard protein-ligand targets, or an H100 for large multimeric assemblies. Choose on-demand if you need the result within the hour, or spot if you're queuing a batch run.

Step 2: Set up the environment.

bash
conda create -n boltz python=3.10
conda activate boltz
conda install -c conda-forge cudatoolkit=12.1
pip install --upgrade pip

Step 3: Install Boltz-2 and download weights.

bash
git clone https://github.com/jwohlwend/boltz.git
cd boltz
pip install -e .
boltz download  # pulls model weights (~4GB) to ~/.boltz/

Step 4: Configure persistent MSA cache.

Mount a persistent volume to your instance from the Spheron dashboard before launching. Set the cache directory:

bash
export BOLTZ_CACHE=/mnt/boltz-cache
mkdir -p $BOLTZ_CACHE

The MSA computation (Jackhmmer + HHblits database search) is the single most time-consuming step per target, often 30-90 minutes on a single CPU. Caching MSA results to a persistent volume means subsequent predictions on the same target sequence skip this step entirely. On a spot cluster, if an instance is reclaimed mid-prediction, the next instance picks up from the cached MSA.

Step 5: Run your first prediction.

bash
boltz predict input.fasta \
  --out_dir ./results \
  --cache /mnt/boltz-cache \
  --recycling_steps 3 \
  --num_diffusion_samples 5

Output files are in CIF format by default. Use --output_format pdb if your downstream tools require PDB.

Step 6: Scale to batch mode.

bash
for fasta in ./targets/*.fasta; do
  boltz predict "$fasta" \
    --out_dir ./results \
    --cache /mnt/boltz-cache \
    --num_workers 4
done

For parallel GPU use across a cluster, distribute individual FASTA files to separate instances. Each instance writes its output to a shared NFS or object storage path. A job scheduler (Celery, Redis queue, or Ray) tracks which targets are complete.

Deploy AlphaFold 3 on Spheron with MSA Caching

AlphaFold 3 is more complex to self-host than Boltz-2 because of the weight license requirement and the MSA database size.

Step 1: Download AF3 weights. Go to the official DeepMind AlphaFold 3 repository and accept the license agreement. The weights are not freely redistributable; you must download them under your own account. Do not share weight files publicly or include them in Docker images.

Step 2: Provision storage for MSA databases. The Jackhmmer and HHblits databases (UniRef90, MGnify, UniRef30, PDB seqres) total approximately 252 GB compressed and ~630 GB unzipped. Mount a persistent volume of at least 1 TB from the Spheron dashboard before running the database download scripts.

Step 3: Clone the repo and build the Docker image.

AlphaFold 3 does not publish a pre-built image to any public registry. You must build it locally. The weights are not bundled into the image; they are mounted at runtime.

bash
git clone https://github.com/google-deepmind/alphafold3.git
cd alphafold3
docker build -t alphafold3 -f docker/Dockerfile .

Step 4: Run a prediction.

bash
docker run --gpus all \
  -v /mnt/af3-databases:/databases \
  -v /mnt/af3-cache:/cache \
  -v /mnt/af3-models:/models \
  -v ./input:/input \
  -v ./output:/output \
  alphafold3 \
    --db_dir=/databases \
    --model_dir=/models \
    --output_dir=/output \
    --input_dir=/input \
    --jackhmmer_binary_path=/usr/bin/jackhmmer \
    --hhblits_binary_path=/usr/bin/hhblits

MSA caching. AF3 writes MSA results to the output directory per target. Move completed MSA files to /mnt/af3-cache and symlink or copy them back before re-running a target. This avoids re-running the 30-90 minute database search on subsequent recycling or diffusion step adjustments.

For large antibody-antigen assemblies above 1500 total residues, use Spheron H100 instances rather than A100. The extra headroom in the H100 SXM5's 80GB HBM3 (3.35 TB/s bandwidth, versus the A100's ~2 TB/s HBM2e) handles the peak activation memory during the diffusion phase.

Cost-Per-Prediction Analysis

OptionGPUPrice/hrPredictions/hrCost/Prediction
Spheron A100 80GB spotA100 SXM4$0.454-8~$0.06-$0.11
Spheron A100 80GB on-demandA100 SXM4$1.044-8~$0.13-$0.26
Spheron H100 SXM5 spotH100 SXM5$0.806-12~$0.07-$0.13
Spheron H100 SXM5 on-demandH100 SXM5$2.906-12~$0.24-$0.48
Spheron H200 SXM5 spotH200 SXM5$1.198-16~$0.07-$0.15
Spheron H200 SXM5 on-demandH200 SXM5$3.968-16~$0.25-$0.50
Isomorphic Labs APIN/AN/AN/A~$0.50
AWS HealthOmics (AF2)N/AN/AN/A~$0.80+

Predictions per hour assume standard monomers (512-1024 aa) with 3 recycling steps and 5 diffusion samples. Large complexes run slower; pre-screens with ESMFold run faster.

The spot A100 case is the default choice for batch virtual screening. At $0.06-$0.11 per prediction, a naive 1M-compound screen (all compounds through GPU prediction, no pre-filtering) would cost $60,000-$110,000 in GPU compute, versus $500,000 at Isomorphic API pricing. In practice, the three-stage pipeline described in the reference architecture below uses CPU-based RDKit filtering in Stage 1 to eliminate 60-80% of compounds before any GPU work, leaving 200,000-400,000 candidates for Stage 2. That reduces actual GPU spend to roughly $12,000-$45,000 for a typical 1M-compound screen. The H100 and H200 spot tiers are cost-competitive for large assemblies that would OOM on A100.

Pricing fluctuates based on GPU availability. The prices above are based on 28 Apr 2026 and may have changed. Check current GPU pricing for live rates.

For large complex workloads specifically requiring 141GB VRAM, check H200 GPU rental pricing for current on-demand and spot availability.

Compliance: Data Residency and Proprietary Compounds

Sending novel compound SMILES strings or protein sequences to a third-party API is a legal and competitive risk for discovery-stage programs. A scaffold or binding mode that hasn't been published is effectively disclosed to whoever processes it. Most API providers have terms that disclaim data use, but the practical risk of a data breach, subpoena, or employee departure at the API provider is real.

The more acute issue for EU labs is GDPR intersection with clinical molecular data. Phase 1-2 programs often tie compound structures to patient biomarker data, which brings them under GDPR's definition of indirectly identifiable patient information. Processing that data outside the EU without appropriate safeguards (Standard Contractual Clauses, adequacy decisions) creates compliance exposure. Running structure prediction on bare-metal Spheron instances within an EU-adjacent region, with no shared tenancy and no data leaving your controlled environment, satisfies both the legal requirement and the internal data governance policy.

Reference Architecture: 1M Compounds/Day Virtual Screening Pipeline

A production screening pipeline at this scale runs three tiers in series, each filtering down the candidate set.

Stage 1: SMILES ligand filter on CPU. Apply Lipinski's Rule of Five, ADMET property filters, and synthetic accessibility scoring to all 1M compound SMILES using RDKit. No GPU required: RDKit evaluates 100,000+ compounds per second per CPU core and clears 1M compounds in under a minute. This eliminates 60-80% of inputs that fail basic drug-likeness criteria, leaving 200,000-400,000 candidates for Stage 2. If the protein target structure is not already known from crystallography or cryo-EM, fold it once with ESMFold on a spot L40S (seconds, 15-20GB VRAM) before the docking stages.

Stage 2: Boltz-2 protein-ligand prediction on spot A100 cluster. Pass surviving candidates (typically 20-40% of the original set after Stage 1 filtering) through full Boltz-2 protein-ligand prediction. Each A100 80GB handles 4-8 predictions per hour. At 8 predictions/GPU/hr across 52 A100 spot instances = 416 predictions/hr total, 200,000 compounds complete in ~481 hours (~20 days). To hit a 24-hour turnaround you need approximately 1,040 A100 spot instances.

Stage 3: AlphaFold 3 on top-N hits using on-demand H100. The top 1,000-5,000 candidates from Stage 2 scoring (binding energy, confidence scores) go through full AlphaFold 3 prediction on on-demand H100 for maximum fidelity. This tier is small enough that on-demand is the right call: you want these results fast and without interruption risk.

MSA cache. Jackhmmer and HHblits output for each unique target sequence is cached to a shared NFS volume mounted across all instances. If a target appears in multiple screening batches, the MSA runs once. Spot interruptions cost only the current prediction, not the MSA re-run.

Job queue. A Redis-backed queue or Ray cluster distributes work. Each worker pulls a FASTA file, checks the MSA cache, runs Boltz-2, and writes the result to shared object storage. A lightweight coordinator marks completion and tracks failed jobs for retry.

Input: 1M compound SMILES
       |
Stage 1: RDKit SMILES filter (1M compounds, <1 min on CPU)
       |
Stage 2: Boltz-2 on A100 spot x~1040 (200K predictions, 24 hrs)
       |
Stage 3: AlphaFold 3 on H100 on-demand (5K predictions, 8-12 hrs)
       |
Output: ranked hit list with confidence scores

Throughput math for Stage 2: With 52 GPUs running 8 predictions/hr each = 416 predictions/hr total, 200,000 compounds complete in 200,000 / 416 ≈ 481 hours (~20 days). Scale to ~1,040 A100 spot instances to hit a 24-hour turnaround.

For Ray or Kubernetes cluster provisioning at this scale, configure each node with the shared NFS volume mounted and the same Redis or queue endpoint so workers can coordinate without a central scheduler process.


Biotech startups running high-throughput virtual screening face a clear choice: pay $0.50+ per prediction to an API, or own your pipeline at a fraction of the cost. Spheron's on-demand and spot A100 and H100 instances give you bare-metal GPU access with no sharing, no rate limits, and predictable cost per prediction.

Rent A100 80GB | Rent H100 | View all GPU pricing

Start deploying on Spheron

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.