SubQ 1M-Preview launched May 5, 2026 as the first commercially available subquadratic LLM, accessible via private-beta API with a 1M-token context window. The model weights are not publicly available; the 1M-Preview API is the commercial offering, while a 12M-token configuration is gated to enterprise research partners. This post covers what SubQ's subquadratic architecture means for hardware sizing, inference costs, and how to plan your GPU stack for when open-weights subquadratic models ship.
What SubQ 1M-Preview Is
Standard Transformer self-attention is O(n²) in both compute and memory. Double the context length and the attention cost quadruples. At 1M tokens, the KV cache alone requires hundreds of gigabytes and the attention FLOPs become the dominant cost per forward pass.
SubQ 1M-Preview replaces the attention kernel with a subquadratic algorithm. Depending on the variant, compute grows O(n log n) or O(n) with context length rather than O(n²). The result: at 1M tokens, the SubQ attention step uses approximately 1/52nd the compute of a comparable Transformer. The KV cache grows linearly with context, not quadratically, which is why the 1M-token context window fits on a single H200 node instead of requiring a multi-node tensor parallelism setup.
This puts SubQ in a different category from other non-Transformer architectures:
| Architecture | Context Scaling | Quality at Long Context | Relative KV-Cache Size |
|---|---|---|---|
| Transformer (standard) | O(n²) compute, O(n) cache | Exact attention, full recall | Baseline (quadratic growth in prefill cost) |
| Mamba / SSM | O(n) fixed state | Good throughput, approximate recall | No KV cache (fixed recurrent state) |
| Linear attention (e.g., xLSTM, RWKV-7) | O(n) fixed state | Approximate softmax attention | No KV cache (compressed state) |
| SubQ 1M-Preview | O(n log n) to O(n) | Near-exact or exact attention | Linear growth (not quadratic) |
The key difference from Mamba and linear-attention models: SubQ maintains near-exact or exact attention over the full context window. SSMs and linear-attention models compress sequence history into a fixed-size state, which improves throughput but loses precise recall on specific tokens far back in the context. SubQ keeps the recall quality closer to a standard Transformer while eliminating the quadratic cost. For a detailed breakdown of VRAM allocation across model sizes and context lengths, see the GPU memory requirements guide for LLMs.
Hardware Requirements: VRAM Sizing for Long Context
The VRAM equation for SubQ differs from Transformers. Prefill KV-cache memory grows linearly with context length (not quadratically), so you can predict hardware requirements more cleanly.
Approximate VRAM requirements for SubQ 1M-Preview (based on paper-reported linear scaling):
| Context Length | SubQ KV-Cache VRAM | Total VRAM (weights + KV cache) | Recommended Config | Comparable Transformer Total |
|---|---|---|---|---|
| 128K | ~4 GB | ~30 GB | 1x H200 SXM5 (141GB) | ~40 GB (manageable) |
| 512K | ~16 GB | ~42 GB | 1x H200 SXM5 | ~160 GB (needs 2xH200) |
| 1M | ~32 GB | ~58 GB | 1x H200 SXM5 | ~640 GB (needs 8xH100) |
| 4M | ~128 GB | ~154 GB | 2x H200 SXM5 | Multi-node required |
| 12M (research-gated) | ~384 GB | ~410 GB | 4-6x H200 SXM5 or 2-3x B200 SXM6 | 40+ GPUs required |
At 1M tokens, SubQ runs on a single H200 SXM5. The equivalent Transformer workload at the same context would need 8x H100 or more. At 12M tokens (the research-gated configuration), SubQ fits on a 4-6 GPU H200 node; a Transformer at that context is not practically deployable.
For 1M-4M token context, provision on-demand H200 SXM5 for long-context inference on Spheron. On-demand rates start at $4.54/hr per GPU; spot pricing goes as low as $1.77/hr per GPU for non-latency-sensitive batch workloads.
For 4M-12M context or high-throughput production serving, Spheron B200 SXM6 instances give more headroom: 192GB HBM3e per GPU and 8 TB/s memory bandwidth. On-demand rates start at $9.30/hr per GPU.
Pricing fluctuates based on GPU availability. The prices above are based on 08 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Inference Stack Planning: What to Prepare When Weights Ship
SubQ 1M-Preview is API-only as of June 2026. The weights are not publicly available. This section covers how to plan your GPU infrastructure so you can move fast once an open-weights subquadratic model becomes available.
Instance Sizing
For 1M-token context, a single H200 SXM5 instance covers the projected VRAM budget based on SubQ's reported linear KV-cache scaling. For the 12M context research configuration, plan for 4-6x H200 SXM5 or 2-3x B200 SXM6. Launch instances on Spheron via the H200 GPU rental page - on-demand instances are live in under 2 minutes.
Deploy with an Ubuntu 22.04 image and the NVIDIA Docker runtime. Verify with nvidia-smi before running any inference workload.
vLLM Integration Path
vLLM is the most likely integration target for SubQ once weights are released, given its support for custom attention kernels and its track record with non-standard model architectures. There is no verified SubQ support in any vLLM release as of June 2026 - the weights are not public and no inference engine could have integrated the custom attention kernel yet. Track the vLLM model support matrix for SubQ support once weights ship.
When SubQ support lands in vLLM, the expected launch pattern for a single H200 node at 1M-token context would follow:
# Forward-looking reference only - SubQ weights are not yet publicly available
vllm serve <subq-model-path> \
--max-model-len 1000000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--served-model-name subq-1m-previewKey flags to plan for:
--max-model-len 1000000: enables the 1M native context window--tensor-parallel-size <N>: set to your GPU count--enable-prefix-caching: SubQ's linear KV cache makes prefix reuse significantly cheaper than on a Transformer because the cache doesn't expand quadratically with context on every new turn--gpu-memory-utilization 0.90: leaves 10% headroom for activation memory and framework buffers
The server exposes an OpenAI-compatible endpoint at http://localhost:8000/v1.
SGLang as an Alternative
SGLang's radix attention cache design would be a strong fit for SubQ once weights are available, particularly for workloads with very high concurrency and tight TTFT requirements at long context. SGLang's efficient KV cache management aligns well with SubQ's linear KV structure. No SGLang support for SubQ exists as of June 2026. Monitor the SGLang project once public weights ship to evaluate whether it has integrated SubQ's attention kernel.
Long-Context Use Cases That Pay Off
Whole-Codebase Analysis
A typical production monorepo runs 500K-5M tokens depending on file count and comment density. With a Transformer, ingesting 500K tokens in a single prompt costs GPU compute quadratically and often requires multi-GPU tensor parallelism just for the prefill step. With SubQ 1M-Preview, the same 500K-token prompt processes in linear time.
Example: ingest a 500K-token codebase as a single context window, then ask SubQ to identify circular dependencies, trace data flows across service boundaries, or generate an accurate dependency graph. The model can answer across the full codebase context without chunking or retrieval.
Multi-Document RAG Without Retrieval
Vector retrieval is a workaround for Transformer context limits. At 1M context, SubQ can ingest dozens of full research papers as a single prompt and answer questions without a retrieval step. You skip the embedding pipeline, the retrieval index, and the context-assembly logic. The tradeoff: you pay the linear prefill cost upfront per query rather than amortizing it across a retrieval index.
This is worth it when recall accuracy matters more than cost and when the document set changes frequently enough that re-indexing for retrieval is expensive. Drug discovery literature reviews, legal discovery, and financial filings analysis are the clearest cases.
Agent Memory Replay
Multi-turn agent systems accumulate large interaction histories. Standard approaches summarize or truncate the history to fit a Transformer's context window, which causes the agent to lose precise recall of earlier steps. With SubQ at 1M context, an agent can pass a large interaction history as context, enabling exact recall of every prior action and decision.
The cost comparison vs a Transformer: at 2M tokens of agent history, a standard Transformer's attention FLOPs make per-turn inference expensive enough that most teams truncate to 128K-256K. SubQ's linear scaling keeps per-turn costs economical at longer context windows.
Benchmarks: Throughput, TTFT, and Cost vs. Alternatives
The following figures are based on the SubQ 1M-Preview paper's reported results and architecture-derived scaling estimates. Costs are calculated using Spheron's H200 on-demand rate ($4.54/hr per GPU, 8-GPU cluster = $36.32/hr). All figures are estimated or paper-reported; independent benchmarks on Spheron infrastructure are not yet available. Treat these as directional rather than production-verified.
| Model | Context Limit | Tokens/sec (8x H200) | TTFT at 1M Tokens | Cost/M Tokens (8x H200, estimated) |
|---|---|---|---|---|
| SubQ 1M-Preview | 12M (research-gated) / 1M (API) | ~80-120 (paper-reported) | ~8-12s (linear prefill) | ~$84-126 (estimated) |
| Llama 4 Scout 109B | 10M tokens | ~20-40 at 1M | ~60-120s (quadratic prefill) | ~$252-504 (estimated) |
| Qwen3-Max 671B | 128K tokens | N/A at 1M (context limit) | N/A | N/A |
| DeepSeek V3 671B MoE | 128K tokens | N/A at 1M (context limit) | N/A | N/A |
The TTFT gap at 1M tokens is the most striking difference. Transformer-based models pay O(n²) prefill compute, so TTFT at 1M tokens is measured in minutes on even an 8x H200 cluster. SubQ's linear prefill keeps TTFT under 12 seconds on the same hardware. For TTFT SLO targets and how to instrument latency in production, see the LLM inference SLO engineering guide.
Production Checklist
- Test stability at extreme context lengths before production. Run 1M+ token prompts in staging and verify the model returns coherent output without OOM or decode failures. SubQ 1M-Preview's context limit claims are paper-reported; validate against your specific GPU memory configuration once weights are available.
- Plan prefix caching from day one. SubQ's linear KV cache makes prefix caching highly effective. Any workload with repeated prompt prefixes (system prompts, document headers, few-shot examples) benefits significantly. Enable
--enable-prefix-cachingin vLLM once SubQ support lands.
- Set a KV cache eviction policy under memory pressure. At very long context with multiple concurrent sessions, KV cache can fill the GPU. Plan to configure
--max-num-seqsin your inference server to limit concurrent sessions to what fits in VRAM before the eviction policy kicks in.
- Adjust autoscaling rules for linear memory. Standard Transformer autoscaling rules (scale when GPU memory hits 80%) will underestimate SubQ's actual capacity. Because KV cache grows linearly rather than quadratically, you can serve more concurrent long-context sessions per GPU than a Transformer config would predict. Benchmark at your target context length and batch size before setting production scaling thresholds.
- Batch sizing at long context. At 1M+ token context, smaller batch sizes preserve TTFT SLO. A batch size of 1-4 at 1M context is typical; batch size 8+ starts introducing queue latency that pushes TTFT above 10s. Benchmark at your target input length to find the right tradeoff.
- Monitor TTFT, ITL, and KV cache utilization separately. SubQ's KV cache utilization metric will differ from Transformer deployments because it grows linearly. The patterns from Transformer serving don't map directly. See the KV cache optimization guide for memory management and utilization instrumentation techniques.
Pricing and Cost Comparison vs. Hyperscalers
Live on-demand pricing for H200 SXM5 and B200 SXM6 on Spheron:
| GPU | VRAM | On-Demand (per GPU/hr) | Spot (per GPU/hr) |
|---|---|---|---|
| H200 SXM5 | 141 GB HBM3e | $4.54 | $1.77 |
| B200 SXM6 | 192 GB HBM3e | $9.30 | $2.74 |
Cost-per-million-tokens at 50 tokens/sec decode throughput (single H200, 1M-token context, batch size 1):
- On-demand: $4.54/hr / (50 tokens/sec × 3600 sec/hr / 1,000,000) = approximately $25.22/M output tokens
- At 1M input tokens + 512 output tokens per request (prefill ~80s on a single H200 + decode ~10s at 50 tok/sec = ~90s total GPU time): total cost per request ≈ $0.114
Compare this to hyperscaler GPU instance pricing for the same H200 hardware tier: AWS p5en.48xlarge (8x H200) lists at approximately $63.30/hr on-demand, or $7.91/GPU/hr, approximately 1.7x higher than Spheron on-demand pricing. Google Cloud A3 Ultra (8x H200) similarly lists above $8/GPU/hr.
Pricing fluctuates based on GPU availability. The prices above are based on 08 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
SubQ 1M-Preview's linear KV-cache growth makes H200 and B200 nodes the practical fit for 1M+ context inference: enough VRAM headroom to avoid multi-node sharding at the context lengths where Transformers require it. Once public weights ship, Spheron's on-demand H200 instances give you the fastest path to production.
H200 SXM5 on Spheron → | B200 SXM6 on Spheron → | View all GPU pricing →
Frequently Asked Questions
SubQ 1M-Preview is a subquadratic attention model that avoids the O(n^2) compute cost of standard self-attention. At 1M+ token context, its attention step runs approximately 52x faster than a comparable Transformer because memory and compute grow linearly with context length instead of quadratically. It launched May 5, 2026 as the first commercial subquadratic model, available via private-beta API. The 1M-Preview API is the commercially available surface; a 12M-token configuration is currently gated to enterprise research partners.
SubQ 1M-Preview eliminates the quadratic KV-cache growth of Transformers. At 1M context it needs approximately 32 GB for the KV cache versus roughly 640 GB total (with KV cache alone accounting for the majority) for an equivalent Transformer, approximately 20x less KV-cache memory at that context length. A single H200 SXM5 (141 GB) handles the full 1M-token context window where a comparable Transformer at that context length would need 8 or more H100 GPUs.
Not yet. As of June 2026, SubQ 1M-Preview is closed-weights and only accessible via the private-beta API at subq.ai with OpenAI-compatible endpoints. No HuggingFace repository or downloadable weights are publicly available. When weights are released, vLLM and SGLang are the expected integration targets given their support for custom attention kernels.
The three highest-ROI use cases are: (1) whole-codebase analysis, where you ingest an entire monorepo as a single prompt, (2) multi-document RAG, where retrieval is replaced by full-document ingestion, and (3) agent memory replay, where an agent's full interaction history is passed as context instead of compressing or summarizing it. At these context lengths, Transformer costs become prohibitive; SubQ's linear scaling makes them economical.
At 1M tokens of context, a standard Transformer spends quadratically more GPU compute on attention than SubQ. The SubQ paper reports approximately 1/5 the attention compute cost at 1M tokens vs a comparably-sized Transformer. In practice this means fewer GPUs for the same throughput target, translating to roughly 70-80% lower per-million-token cost at long context versus an equivalent Transformer serving stack.
