Deploy Step-3.7-Flash on GPU Cloud: Self-Host StepFun's 198B MoE Agentic Vision Model with vLLM (2026 Setup Guide)

Step-3.7-Flash is 198B total parameters with 11B active per forward pass. The GPU math here is the same as for any large MoE: all 198B weight bytes must sit in VRAM, but inference throughput behaves like an 11B dense model. That gap between storage cost and compute cost is what makes this model practical to self-host.

The vision encoder adds a fixed 1.8B ViT component on top of the 196B language backbone. It handles native image and document input, which positions Step-3.7-Flash as an agentic model. StepFun designed it for concurrent agent workflows where multiple workers need fast, short generation cycles with periodic vision inputs.

This guide covers VRAM sizing for the BF16, FP8, and NVFP4 checkpoints, vLLM deployment on Spheron, agentic serving configuration, and the full cost breakdown.

What Is Step-3.7-Flash

Step-3.7-Flash is StepFun's mid-2026 vision-language model. It targets agentic use cases where low per-call latency and native visual understanding matter more than single-call benchmark scores.

Spec	Value
Total Parameters	198B (196B language backbone + 1.8B vision encoder)
Active Parameters	11B per forward pass
Architecture	Mixture of Experts (MoE)
Context Window	256K tokens
Attention	Hybrid sliding-window (512-token window) + global at 3:1 ratio
Vision	1.8B ViT encoder, native image/document input
Official Checkpoints	BF16, FP8, NVFP4
Framework support	vLLM, SGLang, NVIDIA NIM
Released	Late May 2026 (StepFun)
Throughput (reported)	~400 tok/s

The hybrid sliding-window (512-token window) plus global attention design is what enables 256K context without the quadratic KV growth of full self-attention. Most tokens attend locally via the sliding window; global attention layers at a 3:1 ratio handle long-range dependencies. MTP-3 (3-way Multi-Token Prediction) is a separate speculative decoding feature that predicts multiple output tokens in parallel and is not the attention mechanism. This is what makes --disable-cascade-attn a required flag for this model in vLLM; the hybrid SWA/GA schedule is not compatible with cascade attention.

The 11B active parameter count means each individual agentic call is cheap in compute terms. That matters for multi-step loops where the model is making rapid decisions rather than generating long outputs. It does not affect VRAM requirements: all 198B weight bytes must still load into GPU memory before any inference happens.

For broader context on how MoE models work at inference time, see the MoE inference optimization guide. For the VLM-specific architecture patterns, including ViT encoder sizing and visual token inflation, see the vision language models deployment guide.

GPU VRAM Requirements

The critical distinction for Step-3.7-Flash is the same as for every MoE VLM: "11B active" means per-forward-pass compute. VRAM footprint is determined by total parameter count (198B), not active count.

The 1.8B ViT encoder adds a fixed weight overhead on top of the language backbone. At FP16 that is roughly 3.6 GB; at FP8 it is about 1.8 GB. This overhead is already included in the totals below, but it is worth stating explicitly because text-only MoE models of similar size do not carry it.

Precision	Weight VRAM	Formula	KV Headroom Rule
BF16	~396 GB	198B x 2 bytes	Needs 4x H200 SXM5 (564 GB total)
FP8	~198 GB	198B x 1 byte	Needs 4x H100 SXM5 or 2x H200 SXM5
NVFP4	~99 GB	198B x 0.5 bytes	Fits 1x B200 SXM6; Blackwell only (unsupported on H100/H200)

For GPU selection, always add a 15% framework overhead on top of the weight figures before comparing against VRAM capacity. BF16 weights at 396 GB become roughly 455 GB with overhead, beyond a 4x H100 SXM5 node (4 × 80 GB = 320 GB) and requiring 4x H200 SXM5 at minimum.

Here are the recommended configurations on Spheron with current pricing:

GPU Config	VRAM	Quantization	Max Context	On-Demand	Spot	Notes
4x H200 SXM5	564 GB	BF16	256K	$4.54/hr per GPU	$3.31/hr per GPU	Recommended for full-quality; ~168 GB KV after weights
2x H200 SXM5	282 GB	FP8	128K	$4.54/hr per GPU	$3.31/hr per GPU	Production sweet spot; ~84 GB for KV cache
4x H100 SXM5	320 GB	FP8	128K	$3.17/hr per GPU	$2.91/hr per GPU	Budget FP8; ~122 GB for KV cache
1x B200 SXM6	192 GB	NVFP4	128K	$3.70/hr	$5.34/hr (above on-demand)	Single-GPU option (Blackwell only); ~93 GB for KV cache

What does not work:

1x H100 SXM5 (80 GB): Too small even for NVFP4. The NVFP4 weight footprint is approximately 99 GB, which already exceeds 80 GB before any framework overhead or KV cache.
H100/H200 + NVFP4: The official NVFP4 checkpoint targets Blackwell hardware (B200/B300). Hopper-generation GPUs (H100, H200) are not supported for NVFP4. Use FP8 on Hopper.
1x B200 SXM6 + FP8: FP8 weights are roughly 198 GB and the B200 SXM6 has 192 GB VRAM. Weights alone exceed VRAM capacity before any framework overhead or KV cache. Use NVFP4 for single-GPU B200 deployment.
A100 80G: Ampere lacks the FP8 Tensor Engine required for the official FP8 checkpoint. NVFP4 serving on Ampere has not been validated by StepFun. Skip A100 for Step-3.7-Flash.

For the 2x H200 FP8 config, the H200 SXM5 on Spheron is the recommended entry point if you want 128K context in production. For the 4x H100 FP8 option, H100 SXM5 rental on Spheron provides the same FP8 quality at lower per-GPU cost.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Deploy Step-3.7-Flash with vLLM on Spheron

Step 1: Choose GPU and quantization

Pick the configuration that fits your VRAM and budget:

2x H200 SXM5 + FP8: Recommended production default. 128K context, ~84 GB for KV cache, NVLink interconnect for efficient tensor parallelism.
4x H100 SXM5 + FP8: Same FP8 quality at lower per-GPU cost. More KV headroom than the 2x H200 setup for the same context window. H100 spot pricing is roughly 8% cheaper than on-demand right now.
1x B200 SXM6 + NVFP4: Best single-GPU option (Blackwell only). NVFP4 weights (~99 GB) fit comfortably in 192 GB with about 93 GB for KV cache and 128K context. NVFP4 is not supported on Hopper (H100, H200).

Step 2: Provision a Spheron GPU instance

Log in at app.spheron.ai and navigate to GPU Cloud. Select your GPU tier and deploy with the PyTorch 2.5 / CUDA 12.4 base image. Attach at least 250 GB of persistent storage for model weights. Step-3.7-Flash FP8 weights are roughly 198 GB, so 250+ GB storage prevents re-downloading on instance restarts.

See docs.spheron.ai for SSH setup and instance configuration guidance.

Step 3: Install vLLM and download weights

bash

pip install 'vllm>=0.9.0'
pip install huggingface_hub hf_transfer

export HF_TOKEN=your_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1

# Download the pre-quantized FP8 checkpoint (recommended for H100/H200 deployments)
huggingface-cli download stepfun-ai/Step-3.7-Flash-FP8

# Or for single-GPU B200 SXM6 NVFP4 deployment (Blackwell only)
huggingface-cli download stepfun-ai/Step-3.7-Flash-NVFP4

If vLLM does not natively recognize the model class, add --trust-remote-code to the serve command. For models released after the last stable vLLM build, check docs.vllm.ai for supported model status.

Step 4: Launch vLLM with hybrid attention configuration

2x H200 SXM5, FP8 (recommended production):

bash

vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --disable-cascade-attn \
  --tool-call-parser step3p5 \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' \
  --host 0.0.0.0 \
  --port 8000

1x B200 SXM6, NVFP4 (single-GPU option, Blackwell only):

bash

vllm serve stepfun-ai/Step-3.7-Flash-NVFP4 \
  --quantization modelopt \
  --max-model-len 131072 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.92 \
  --enable-chunked-prefill \
  --disable-cascade-attn \
  --host 0.0.0.0 \
  --port 8000

4x H100 SXM5, FP8 (budget option):

bash

vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --max-model-len 131072 \
  --kv-cache-dtype fp8_e5m2 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --disable-cascade-attn \
  --host 0.0.0.0 \
  --port 8000

Why --disable-cascade-attn: Step-3.7-Flash's hybrid sliding-window (512-token window) plus global attention at a 3:1 ratio is not compatible with cascade attention. This flag is required for the model to run correctly. --enable-chunked-prefill is also included as a general improvement for long-context prefilling, breaking long input sequences into fixed-size chunks to prevent KV cache spikes when agentic loops pass large documents.

Why the agentic flags (recommended command only): --tool-call-parser step3p5 and --reasoning-parser step3p5 enable correct parsing of tool calls and reasoning outputs. The step3p5 identifier is vLLM's registered parser name for the full Step 3.x model family. Despite the "3.5" in the name, vLLM intentionally reuses this parser for both Step-3.5-Flash and Step-3.7-Flash (confirmed in the official vLLM recipes documentation for Step-3.7-Flash). --enable-auto-tool-choice allows the model to select tools automatically. --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' activates MTP-3 (3-way Multi-Token Prediction) for speculative decoding, which can meaningfully improve throughput on short agentic calls.

Note on NVFP4 hardware: The official StepFun NVFP4 checkpoint targets Blackwell hardware (B200/B300). NVFP4 is not supported on Hopper-generation GPUs (H100, H200). For single-GPU NVFP4 deployment, use B200 SXM6.

Step 5: Test the endpoint

Text completion (baseline check):

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stepfun-ai/Step-3.7-Flash-FP8",
    "messages": [
      {
        "role": "user",
        "content": "Summarize the key differences between FP8 and NVFP4 quantization for LLM inference."
      }
    ],
    "max_tokens": 256,
    "temperature": 0.2
  }'

Multimodal vision request:

bash

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "stepfun-ai/Step-3.7-Flash-FP8",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "Describe what you see in this image and extract any visible text."
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/240px-PNG_transparency_demonstration_1.png"
            }
          }
        ]
      }
    ],
    "max_tokens": 256
  }'

A correct multimodal response returns a choices[0].message.content string describing the image. This confirms the ViT encoder is loaded and routing correctly through the language backbone.

Step 6: Cloud-init startup script

For reproducible instance bootstrapping, use this startup script in your Spheron deployment configuration:

bash

#!/bin/bash
set -e

echo "--- Setting Up Environment ---"

sudo apt-get update -y
sudo apt-get install -y python3-venv

sudo python3 -m venv /opt/step_flash_venv
source /opt/step_flash_venv/bin/activate

pip install --upgrade pip
pip install 'vllm>=0.9.0' huggingface_hub hf_transfer

echo "--- Downloading Step-3.7-Flash FP8 weights ---"

export HF_TOKEN=your_hf_token_here
export HF_HUB_ENABLE_HF_TRANSFER=1

huggingface-cli download stepfun-ai/Step-3.7-Flash-FP8

echo "--- Launching vLLM Server ---"

nohup vllm serve stepfun-ai/Step-3.7-Flash-FP8 \
    --tensor-parallel-size 2 \
    --enable-expert-parallel \
    --max-model-len 131072 \
    --kv-cache-dtype fp8_e5m2 \
    --gpu-memory-utilization 0.90 \
    --enable-chunked-prefill \
    --disable-cascade-attn \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key "your_api_key_here" > /var/log/vllm.log 2>&1 &

echo "--- Waiting for server to initialize (ETA 15-25 minutes for FP8 weight download) ---"

for i in {1..3600}; do
  if curl -sf "http://localhost:8000/v1/models" > /dev/null; then
    echo "vLLM server is ready!"
    break
  fi
  if [ $((i % 30)) -eq 0 ]; then
    echo "Still waiting... ($i seconds elapsed)"
  fi
  sleep 1
done

curl -sf http://localhost:8000/health > /dev/null || { echo 'ERROR: vLLM failed to start'; exit 1; }

Agentic Workload Patterns

Step-3.7-Flash's design targets three agentic use cases specifically.

Concurrent coding agents

Multiple agent workers each making rapid short calls. The 11B active parameter count means per-call compute is close to an 11B dense model, even though all 198B bytes sit in VRAM. At 400 tok/s with short 50-200 token outputs, you can serve many concurrent agent workers from a single GPU node.

Tune --max-num-seqs based on your target concurrency. Start at 16 and increase while monitoring GPU memory utilization. For coding agents that submit patches in parallel, 32-64 concurrent sequences on a 2x H200 FP8 setup is a reasonable ceiling before KV cache starts to saturate.

Multi-step search and document parsing loops

The 256K context window lets agentic loops ingest large documents without chunking. Each image passed to the ViT encoder produces visual tokens that occupy KV cache space. For loops that pass a new screenshot or document scan on each iteration, KV cache fills faster than in text-only deployments at the same max-model-len.

Reduce --max-num-seqs compared to text-only serving at the same VRAM budget. The visual token inflation formula from the VLM deployment guide referenced above applies directly: each 1024px image adds roughly 1024 visual tokens to the per-request KV budget.

For multi-step workflows that need durable retry logic across agent calls, see the AI agent workflow orchestration guide for how Temporal and Inngest handle failure recovery without re-running expensive inference steps.

Vision-first agentic perception

Screenshots, UI parsing, chart understanding, and document OCR as primary inputs in an agentic loop. Step-3.7-Flash's native 1.8B ViT encoder handles this without a separate vision model. For the agentic RAG guide use case where Step-3.7-Flash processes retrieved document images alongside text chunks, this eliminates the need to run a separate document parsing model alongside the LLM.

Cost per vision request comparison: at $3.70/hr for a single B200 SXM6 NVFP4 serving at 300 tok/s with 200-token outputs, each vision request costs approximately $0.0007. GPT-4o Vision API charges per image token in addition to text tokens; at typical screenshot resolution, API cost per request runs $0.05-0.15 depending on image size. Self-hosting Step-3.7-Flash becomes cost-positive at moderate agentic request volumes.

See the Spheron docs for SSH setup and multi-instance configuration guidance.

Throughput, Cost Per Million Tokens, and Benchmarks

Throughput figures for Step-3.7-Flash are not yet independently benchmarked across all configurations. The estimates below are derived from the reported ~400 tok/s figure and comparable 198B MoE architectures. Validate on your Spheron instance before making capacity decisions.

Cost per million token formula: ($/hr / 3600) / (tok_s / 1_000_000)

GPU Config	Context	Throughput (tok/s, est.)	TTFT (ms, est.)	$/M tokens (est.)
2x H200 SXM5, FP8	8K	~350-450	~100-200	~$6.30 (at $4.54/hr per GPU)
4x H100 SXM5, FP8	8K	~300-400	~130-250	~$10.06 (at $3.17/hr per GPU)
1x B200 SXM6, NVFP4	8K	~400-600	~90-180	~$2.06 (at $3.70/hr)

The $/M tokens estimates assume the midpoint of the throughput range. Vision requests with image inputs will reduce effective throughput because ViT encoding adds latency before the first text token is generated.

For reference: GPT-4o Vision API pricing is $5-15/M tokens depending on image size and output length. At 2x H200 FP8 on-demand pricing, Step-3.7-Flash runs at roughly $6/M tokens estimated before volume optimizations like spot pricing and batching.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Jun 2026 and may have changed. Check current GPU pricing → for live rates.

Spot vs On-Demand for Agentic Workloads

Step-3.7-Flash's 11B active parameter design creates short generation windows per agentic call, typically 50-300 output tokens per step. When a spot instance is reclaimed mid-generation, only the current call's output is lost, not the full agent session. An agent controller with retry logic (re-submitting the interrupted call) recovers in seconds rather than minutes.

This makes Step-3.7-Flash a good candidate for spot pricing on development workloads and batch agentic jobs. For a 4x H100 SXM5 cluster running FP8:

Mode	Per-GPU Rate	4x GPU Total	Daily Cost (8 hr)
On-demand	$3.17/hr	$12.68/hr	~$101.44
Spot	$2.91/hr	$11.64/hr	~$93.12

For H100 SXM5, spot at $2.91/hr is roughly 8% cheaper than on-demand at $3.17/hr, making spot worthwhile for workloads that can tolerate brief interruptions. For H200 SXM5, spot at $3.31/hr vs on-demand at $4.54/hr is a roughly 27% savings, making spot the right choice for development and batch inference jobs where brief interruptions are acceptable.

Production Checklist

Before routing real agent traffic to Step-3.7-Flash:

Enable streaming. Agent frameworks expect stream: true responses. Without streaming, the agent controller waits for the full completion before acting on output, which breaks the responsiveness of multi-step loops.
Cap concurrent sequences. Set --max-num-seqs 16 or lower for vision-heavy agentic workloads. Each image in a request inflates KV cache significantly; more concurrent sequences multiplies this effect.
Mount persistent storage for weights. FP8 weights are roughly 198 GB. Re-downloading on every instance restart adds 20-30 minutes of downtime. Mount a persistent volume at the HuggingFace cache path.
Monitor KV cache utilization. Enable Prometheus scraping with --enable-metrics and alert on vllm:gpu_cache_usage_perc above 85%. Vision requests fill the KV cache faster than text-only requests at the same sequence length.
Pin your vLLM version. Step-3.7-Flash was released in late May 2026. If native model class support is not yet in the stable release, pin the version that works and document it in your startup script.
Use the correct repo slugs. The confirmed HuggingFace slugs are stepfun-ai/Step-3.7-Flash-FP8 and stepfun-ai/Step-3.7-Flash-NVFP4 for the pre-quantized checkpoints, and stepfun-ai/Step-3.7-Flash for the BF16 base. Verify at huggingface.co/stepfun-ai before deploying.
License review. Confirm the license terms for Step-3.7-Flash at the official HuggingFace model page before commercial production deployment.

Step-3.7-Flash's 11B active-parameter design makes it one of the more cost-efficient ways to run a 256K-context vision-language model for agentic pipelines. FP8 on a 2x H200 or 4x H100 cluster covers most production footprints, and a single B200 SXM6 with NVFP4 is the recommended path for single-GPU development and lower-concurrency workloads.
Get started on Spheron →

STEPS / 06

Quick Setup Guide

Calculate VRAM and choose a quantization checkpoint
Decide on your quantization level based on available GPU VRAM. BF16 needs approximately 396 GB of GPU memory (weight only), requiring at least 4x H200 SXM5. FP8 needs roughly 198 GB, fitting on 4x H100 SXM5 or 2x H200 SXM5. NVFP4 needs about 99 GB and fits on a single B200 SXM6 (Blackwell only; not supported on H100/H200). For production, FP8 on a 2x H200 setup is the recommended balance of quality, cost, and VRAM headroom.
Provision a GPU instance on Spheron
Log in at app.spheron.ai, navigate to GPU Cloud, and select your GPU tier. For 2x H200 FP8, select a 2-GPU H200 SXM5 bundle. For 4x H100 FP8, select a 4-GPU H100 SXM5 bundle. Attach at least 250 GB of persistent storage for model weights. Enable spot pricing where available for hourly savings. See docs.spheron.ai for SSH setup and instance configuration guidance.
Install vLLM and download Step-3.7-Flash weights
Run pip install 'vllm>=0.9.0' huggingface_hub hf_transfer. Export your HuggingFace token as HF_TOKEN and enable fast transfers with HF_HUB_ENABLE_HF_TRANSFER=1. Download the pre-quantized FP8 checkpoint: huggingface-cli download stepfun-ai/Step-3.7-Flash-FP8. For single-GPU B200 NVFP4 deployment use: huggingface-cli download stepfun-ai/Step-3.7-Flash-NVFP4.
Launch the vLLM inference server with hybrid attention config
For 2x H200 SXM5 with FP8: vllm serve stepfun-ai/Step-3.7-Flash-FP8 --tensor-parallel-size 2 --enable-expert-parallel --max-model-len 131072 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.90 --enable-chunked-prefill --disable-cascade-attn --host 0.0.0.0 --port 8000. The --disable-cascade-attn flag is required because Step-3.7-Flash's hybrid sliding-window (512-token window) plus global attention at a 3:1 ratio is not compatible with cascade attention. --enable-chunked-prefill handles long-context prefilling efficiently for agentic loops.
Test the OpenAI-compatible endpoint with a vision request
Send a multimodal POST request to http://localhost:8000/v1/chat/completions with a messages array that includes a content field of type array containing both a text item and an image_url item. Verify the model returns a text response describing the image content. Also test a plain text completion to confirm basic serving is working before routing vision traffic.
Configure for agentic workloads
Set --max-num-seqs lower than you would for text-only serving. Each agentic call that includes an image adds visual tokens to the KV cache, which fills faster than text-only workloads at the same max-model-len. Start with --max-num-seqs 16 and increase only after measuring actual KV cache utilization with vLLM's Prometheus metrics endpoint (/metrics). For concurrent coding agents making rapid short calls, a higher --max-num-seqs with a lower --max-model-len often yields better overall throughput.

FAQ / 05

Frequently Asked Questions

Step-3.7-Flash has 198B total parameters (196B language backbone plus a 1.8B ViT encoder), all of which must reside in VRAM regardless of the 11B active parameter count. At BF16, weights alone consume roughly 396 GB, requiring at least 4x H200 SXM5. At FP8 the weight footprint drops to around 198 GB, fitting on 4x H100 SXM5 (320 GB) or 2x H200 SXM5 (282 GB). The NVFP4 checkpoint brings weights down to about 99 GB, which fits on a single B200 SXM6 (192 GB, Blackwell only) with roughly 93 GB left for KV cache.

Yes, on a B200 SXM6 with the NVFP4 checkpoint. NVFP4 weights are approximately 99 GB, leaving about 93 GB of the B200's 192 GB for KV cache. The official NVFP4 checkpoint targets Blackwell hardware (B200/B300) and is not supported on Hopper-generation GPUs (H100, H200). For single-GPU deployment, the 1x B200 SXM6 NVFP4 config with --max-model-len 128K is the recommended path.

FP8 uses one byte per parameter, cutting the 198B weight footprint to about 198 GB with minimal quality loss over BF16. FP8 runs on Hopper (H100/H200) and Blackwell (B200/B300) Tensor Cores. NVFP4 uses half a byte per parameter (roughly 99 GB), enabling single-GPU B200 deployment with about 93 GB left for KV cache. The official NVFP4 checkpoint targets Blackwell hardware only; H100 and H200 (Hopper generation) do not support it. StepFun ships dedicated pre-quantized checkpoints: stepfun-ai/Step-3.7-Flash-FP8 and stepfun-ai/Step-3.7-Flash-NVFP4.

Step-3.7-Flash uses hybrid sliding-window (512-token window) plus global attention at a 3:1 ratio. vLLM requires --disable-cascade-attn because the hybrid SWA/GA schedule is incompatible with cascade attention. --enable-chunked-prefill is also recommended for efficient long-context prefilling. Native model class support depends on the vLLM release; check docs.vllm.ai for the current supported model list and add --trust-remote-code as a fallback if the model class is not yet integrated.

Yes. Step-3.7-Flash is specifically designed for agentic use cases: its 11B active parameters mean each individual LLM call is fast despite the 198B total weight footprint, and its native vision encoder handles document parsing, screenshot analysis, and chart understanding without a separate model. The 256K context window supports long conversation histories and multi-document ingestion in multi-step loops. For agentic workloads with repeated image inputs, plan for visual token inflation in the KV cache and reduce --max-num-seqs compared to text-only serving at the same VRAM budget.

What Is Step-3.7-Flash

GPU VRAM Requirements

Deploy Step-3.7-Flash with vLLM on Spheron

Step 1: Choose GPU and quantization

Step 2: Provision a Spheron GPU instance

Step 3: Install vLLM and download weights

Step 4: Launch vLLM with hybrid attention configuration

Step 5: Test the endpoint

Step 6: Cloud-init startup script

Agentic Workload Patterns

Concurrent coding agents

Multi-step search and document parsing loops

Vision-first agentic perception

Throughput, Cost Per Million Tokens, and Benchmarks

Spot vs On-Demand for Agentic Workloads

Production Checklist

Quick Setup Guide

Calculate VRAM and choose a quantization checkpoint

Provision a GPU instance on Spheron

Install vLLM and download Step-3.7-Flash weights

Launch the vLLM inference server with hybrid attention config

Test the OpenAI-compatible endpoint with a vision request

Configure for agentic workloads

Frequently Asked Questions

01How much VRAM does Step-3.7-Flash need?

02Can I run Step-3.7-Flash on a single GPU?

03What is the difference between Step-3.7-Flash's FP8 and NVFP4 checkpoints?

04Does vLLM support Step-3.7-Flash's hybrid sliding-window attention?

05Is Step-3.7-Flash suitable for agentic pipelines?

Build what's next.