Classical forecasting at scale is expensive and brittle. Maintaining separate ARIMA or Prophet models for thousands of SKUs requires constant retraining pipelines and per-series tuning. Time series foundation models change this: Chronos, Moirai, TimesFM, and Lag-Llama deliver zero-shot forecasts with no task-specific training, directly on new data your model has never seen. As of 2026, this approach is production-viable. Model accuracy on standard benchmarks (GIFT-Eval, LOTSA) now matches or exceeds domain-specific models, tooling has matured, and GPU cloud spot pricing makes batch forecasting jobs genuinely economical. This guide covers GPU setup end-to-end: hardware requirements, serving configs, throughput benchmarks, and cost comparison against AWS Forecast. For batch serving patterns that apply equally here, see batch LLM inference on GPU cloud.
What Are Time Series Foundation Models?
Classical approaches each have a tax. ARIMA requires per-series stationarity analysis and manual lag selection. Prophet handles seasonality well but needs holiday and regressor configuration for every series. XGBoost and LightGBM can generalize across series but require significant feature engineering: lag features, rolling statistics, calendar encodings, and series metadata.
Time series foundation models take a different path. They are pretrained on large, diverse collections of real-world and synthetic time series, from household electricity consumption to hospital admissions to financial markets. During pretraining, the model learns general patterns: seasonality at multiple frequencies, trend changes, noise levels, and distributional shapes. At inference time, you pass in a raw context window and request a forecast. No training required. No feature engineering. No stationarity tests.
Why 2026 is the inflection point:
- Benchmark quality: GIFT-Eval and LOTSA benchmarks show zero-shot foundation models matching or beating per-series statistical models across most standard categories.
- Tooling maturity: Hugging Face integration, BentoML support, and stable Python APIs make deployment straightforward.
- GPU cloud economics: L40S on-demand instances at $0.72/hr make overnight batch forecasting jobs cheaper per million predictions than any managed forecasting API.
Model Comparison: Chronos, Moirai, TimesFM, Lag-Llama
| Model | Org | Architecture | Parameters | Context Length | Output | License |
|---|---|---|---|---|---|---|
| Chronos-T5 | Amazon | T5 encoder-decoder | 46M / 200M / 710M | Up to 512 time steps | Quantile distribution | Apache 2.0 |
| Chronos-Bolt | Amazon | Distilled T5 | 9M / 21M / 48M / 205M | Up to 512 time steps | Quantile distribution | Apache 2.0 |
| Moirai | Salesforce | Encoder-decoder (Universal TS Transformer) | 14M / 91M / 311M | Up to 5000 time steps | Distributional (Normal, NegBinomial, StudentT) | CC BY-NC 4.0 |
| TimesFM | Google DeepMind | Decoder-only Transformer | 200M / 500M | Up to 512 time points (1.0) / 2048 time points (2.0) | Point + quantile | Apache 2.0 |
| Lag-Llama | Various | Llama decoder | 2.5M | Configurable (lag features) | Probabilistic (Student-T) | Apache 2.0 |
Chronos-T5 uses a token vocabulary approach: it discretizes time series values into bins and treats forecasting as a sequence-to-sequence language task. This gives it broad zero-shot accuracy across diverse domains and clean probabilistic output via Monte Carlo sampling. Chronos-Bolt is a distilled version that is up to 250x faster and 20x more memory-efficient than Chronos-T5, making it the better default for latency-sensitive or resource-constrained deployments.
Moirai from Salesforce uses a patch-based universal encoder that handles multivariate series and irregular timestamps natively. It is the right choice when you have covariates (promotions, weather, economic indicators) or when your data has variable sampling rates like IoT sensor streams.
TimesFM from Google DeepMind uses a decoder-only transformer, which means autoregressive generation with no encoder pass. This makes it the fastest architecture for online inference where latency matters. TimesFM-2.0 extends the context length to 2048 time points and improves accuracy on longer horizons.
Lag-Llama uses the Llama backbone with lag features as input representations. Because the architecture is identical to LLMs you likely already fine-tune, LoRA and PEFT tooling applies directly. If you have 10,000+ proprietary series and need domain adaptation, Lag-Llama is the easiest to fine-tune.
GPU Hardware Requirements
VRAM estimates use the fp16 rule of thumb (2 bytes per parameter) with a 1.5x overhead multiplier for KV cache and activations.
| Model Variant | Parameters | Min VRAM (fp16) | Recommended GPU | Max Batch Size (recommended GPU) |
|---|---|---|---|---|
| Chronos-Bolt-Small | 9M | ~0.1 GB | Any GPU | 512 |
| Chronos-T5-Small | 46M | ~0.5 GB | Any GPU with 4 GB+ | 512 |
| Chronos-T5-Base | 200M | ~1 GB | Any GPU with 8 GB+ | 256 |
| Chronos-T5-Large | 710M | ~2.5 GB | L40S / A100 | 128 |
| Moirai-Small | 14M | ~0.3 GB | Any GPU | 512 |
| Moirai-Base | 91M | ~0.8 GB | Any GPU with 4 GB+ | 256 |
| Moirai-Large | 311M | ~1.5 GB | L40S / A100 | 128 |
| TimesFM-200M | 200M | ~1 GB | Any GPU with 4 GB+ | 256 |
| TimesFM-500M | 500M | ~2 GB | L40S / A100 | 128 |
| Lag-Llama-2.5M | 2.5M | <0.1 GB | Any GPU with 4 GB+ | 512 |
The L40S 48GB and A100 80GB are the sweet spot for production time series workloads. With 48 GB and 80 GB of VRAM respectively, you can run all Chronos sizes simultaneously (Small + Base + Large totals ~4 GB) from a single instance, which matters for tiered SLA designs where you want to route simple forecasts to a smaller model and complex requests to Chronos-T5-Large. The A100 80GB also gives you headroom for Moirai-Large with large batch sizes and long context windows. See L40S vs A100 for a deeper GPU comparison including bandwidth and compute tradeoffs.
Time series foundation models are also a natural fit for GPU cloud requirements planning: Chronos-T5-Large (710M) tops out at ~2.5 GB VRAM, meaning even the largest Chronos variant sits comfortably in the L40S tier without requiring H100-class hardware.
Production Deployment: BentoML and Triton
BentoML Serving (Recommended for Most Teams)
BentoML wraps your model in a production-grade API with built-in batching, health checks, and Prometheus metrics. For Chronos:
import asyncio
import torch
import bentoml
import numpy as np
from chronos import ChronosPipeline
from typing import List
@bentoml.service(
resources={"gpu": 1},
traffic={"timeout": 30},
)
class ChronosService:
def __init__(self):
device = "cuda" if torch.cuda.is_available() else "cpu"
self.pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-large",
device_map=device,
torch_dtype=torch.bfloat16,
)
@bentoml.api(batchable=True, max_batch_size=64, max_latency_ms=5000)
async def predict(
self,
context: List[List[float]],
prediction_length: int = 24,
num_samples: int = 100,
) -> dict:
tensors = [torch.tensor(s, dtype=torch.float32) for s in context]
forecast = await asyncio.to_thread(
self.pipeline.predict,
tensors,
prediction_length=prediction_length,
num_samples=num_samples,
)
return {
"p10": forecast.quantile(0.1, dim=1).tolist(),
"p50": forecast.quantile(0.5, dim=1).tolist(),
"p90": forecast.quantile(0.9, dim=1).tolist(),
}Build and serve:
pip install chronos-forecasting bentoml torch
# Start dev server
bentoml serve service:ChronosService
# Build container for deployment
bentoml build
bentoml containerize chronos_service:latestThe service exposes a POST endpoint at /predict. BentoML handles request collation into batches up to max_batch_size=64 or until max_latency_ms=5000 elapses, whichever comes first.
NVIDIA Triton Inference Server (For Maximum Throughput)
Triton with dynamic batching is the better choice when you need sustained high-QPS serving and want fine-grained control over batching latency. Export Chronos to ONNX using Optimum, which captures the full encoder-decoder generation loop. Using torch.onnx.export on model.forward() traces only a single decoder step: the exported graph returns logits for one token, not a complete forecast. Generating a 24-step forecast requires running the decoder 24 times autoregressively, appending the argmax token to decoder_input_ids each step — that loop is absent from a torch.onnx.export graph. Use Optimum instead:
# pip install optimum[onnxruntime-gpu]
from optimum.onnxruntime import ORTModelForSeq2SeqLM
# Exports encoder, decoder, and decoder_with_past as separate ONNX graphs,
# preserving the full autoregressive generation loop.
model = ORTModelForSeq2SeqLM.from_pretrained(
"amazon/chronos-t5-large",
export=True,
)
model.save_pretrained("chronos_large_onnx/")
# Produces:
# chronos_large_onnx/encoder_model.onnx
# chronos_large_onnx/decoder_model.onnx
# chronos_large_onnx/decoder_with_past_model.onnxSeq2seq models require a Triton ensemble that orchestrates three ONNX graphs (encoder, decoder, decoder_with_past). The ensemble config.pbtxt:
name: "chronos_encoder"
backend: "onnxruntime"
max_batch_size: 128
dynamic_batching {
max_queue_delay_microseconds: 1000
preferred_batch_size: [32, 64, 128]
}
input [
{ name: "input_ids", data_type: TYPE_INT64, dims: [-1] },
{ name: "attention_mask", data_type: TYPE_INT64, dims: [-1] }
]
output [
{ name: "last_hidden_state", data_type: TYPE_FP32, dims: [-1, -1] }
]The decoder and autoregressive loop are orchestrated by a Triton BLS (Business Logic Scripting) Python model that calls encoder_model, then iterates decoder_with_past_model for each forecast step. For teams without Triton BLS experience, BentoML is the simpler path: it calls pipeline.predict directly, which handles the full generation loop in Python before returning results.
Use Triton over BentoML when you are sustaining more than 500 requests/second and need dynamic batching overhead below 1ms. BentoML wins for teams that want simpler operations, built-in Python service logic, and faster iteration.
Batched Probabilistic Forecasting
Probabilistic output is what separates foundation models from point forecasters. Instead of a single predicted value at each horizon step, Chronos and Moirai return a distribution: you get P10, P50, and P90 quantiles, which map to "optimistic", "median", and "pessimistic" scenarios for planning purposes.
For Chronos, num_samples controls the number of Monte Carlo draws used to estimate quantiles. Higher values give more accurate quantile estimates at the cost of compute:
import torch
from chronos import ChronosPipeline
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-large",
device_map="cuda",
torch_dtype=torch.bfloat16,
)
# Batch of 32 series, 168 historical steps each
context = [torch.randn(168) for _ in range(32)]
forecast = pipeline.predict(
context,
prediction_length=24,
num_samples=100,
)
p10 = forecast.quantile(0.1, dim=1) # (32, 24) - lower bound
p50 = forecast.quantile(0.5, dim=1) # (32, 24) - median
p90 = forecast.quantile(0.9, dim=1) # (32, 24) - upper bound
# Compute quantile loss on held-out validation set
def quantile_loss(q, y_true, y_pred_q):
errors = y_true - y_pred_q
return torch.mean(torch.max(q * errors, (q - 1) * errors))
val_actuals = torch.randn(32, 24) # (32, 24) ground truth values
ql_p50 = quantile_loss(0.5, val_actuals, p50)
ql_p90 = quantile_loss(0.9, val_actuals, p90)For Moirai, the output distribution type depends on the data domain. Moirai automatically selects between Normal, NegativeBinomial (for count data like sales), and StudentT (for heavy-tailed distributions).
Latency and Throughput Benchmarks
Approximate throughput for Chronos-T5-Large at batch_size=32, prediction_length=24, num_samples=100 on Spheron GPU instances.
| GPU | On-demand price/hr | Spot price/hr | Forecasts/sec (approx.) | Cost per 1M forecasts |
|---|---|---|---|---|
| L40S 48GB | $0.72 | N/A | ~420 | ~$0.48 (on-demand) |
| A100 80GB | $1.04 | N/A (limited availability) | ~380 | ~$0.76 (on-demand) |
| H100 80GB PCIe | $7.91 | N/A | ~650 | ~$3.38 (on-demand) |
Throughput figures are approximate estimates based on parameter-count scaling and published benchmarks. Actual numbers vary with context length, num_samples, and driver versions. Profile with your own series lengths before capacity planning. A100 80GB spot capacity is currently constrained on Spheron; on-demand is the recommended tier for A100 workloads.
Pricing fluctuates based on GPU availability. The prices above are based on 11 May 2026 and may have changed. Check current GPU pricing for live rates.
The L40S is the most cost-effective tier for batch time series forecasting. Its 48 GB VRAM fits all Chronos and TimesFM variants with room for batch_size=128+. For mixed workloads where you run both LLM inference and time series forecasting on the same instance, the A100 80GB gives more headroom for concurrency. The H100 delivers highest throughput per GPU but at a price point that only makes sense for latency-sensitive real-time serving.
For context on how these GPU rates compare across providers, see the GPU cloud pricing comparison 2026.
Real-World Use Cases
Retail demand planning. Run Chronos-T5-Large overnight across 50,000 SKUs, forecasting 28-day demand for inventory positioning. At ~420 forecasts/sec on an L40S, the full 50K-SKU batch completes in roughly 2 minutes. AWS Forecast at $0.60 per 1,000 forecasted time series would cost around $30 for the same 50K series. The GPU job costs under $0.03.
Cloud capacity forecasting. Feed CPU and memory utilization metrics into TimesFM to predict utilization spikes 24-48 hours ahead. Autoscaling systems can pre-warm instances before traffic arrives instead of reacting after the spike. TimesFM's decoder-only architecture makes it fast enough for sub-minute refresh cycles on a single GPU.
IoT sensor anomaly detection. Moirai handles irregularly sampled sensor telemetry natively through its patch encoder. Run Moirai-Large over a fleet of industrial sensors, flag series where the P90 forecast deviates from observed values by more than 3 sigma, and route anomalous series to a human review queue. Moirai's multivariate support lets you include temperature, pressure, and vibration readings as covariates for better anomaly context.
Ad bidding and revenue forecasting. Generate 15-minute-resolution CTR and revenue forecasts for budget pacing in programmatic advertising. Low latency requirements (sub-second per series) suit TimesFM's decoder-only architecture. At 15-minute granularity with a 4-hour prediction horizon, you get 16 steps per series, and TimesFM processes batches of 1000 series in under a second on an L40S.
Fine-Tuning vs Zero-Shot: When Each Wins
| Scenario | Recommendation |
|---|---|
| Standard retail seasonality, 100+ history points | Zero-shot (Chronos or TimesFM) |
| Irregular sampling rates, multivariate covariates | Zero-shot (Moirai) |
| Proprietary domain, 10K+ training series available | Fine-tune Lag-Llama with LoRA |
| Very short series (fewer than 50 points) | Classical (ARIMA, ETS) - foundation models need context |
| Real-time streaming with sub-second latency | TimesFM zero-shot (fastest architecture) |
| Latency-tolerant batch, want smallest possible model | Chronos-Bolt (up to 250x faster than T5, 20x more memory-efficient, similar accuracy) |
For fine-tuning Lag-Llama with LoRA: the Llama backbone means standard PEFT tooling applies directly. You can fine-tune on a single L40S for domain adaptation, then serve the base model plus adapter weights from a single instance. See the LoRA multi-adapter serving guide for serving multiple fine-tuned adapters on one GPU.
Cost Comparison: GPU Cloud vs Managed Forecasting Services
| Service | Pricing model | Cost per 1K time series | Notes |
|---|---|---|---|
| AWS Forecast | Per forecasted time series | $0.60 | Minimum 1K time series, no spot pricing |
| GCP Vertex AI Forecasting | Per 1,000 rows at prediction | ~$0.60 | AutoML pricing |
| SageMaker Canvas (time series) | Session + inference | Variable | Session-based billing adds overhead |
| Chronos on Spheron L40S | GPU cost / forecasts | ~$0.0005 | At scale batch jobs (~420 forecasts/sec, $0.72/hr) |
Managed services win for low-volume, occasional forecasting jobs where you do not want to manage infrastructure. Below roughly 100K forecasted time steps per month, the setup cost and operational overhead of self-hosting outweighs the per-step savings.
Self-hosting wins decisively above ~1 million forecasts per month. At that scale, the cost difference is over three orders of magnitude. Chronos-T5-Large on a single L40S processes millions of forecasted steps per hour at a fraction of managed API cost.
For the broader make-vs-buy analysis framework, see LLM inference on-premise vs GPU cloud.
Deploy Time Series Models on Spheron: Step-by-Step
- Log in to app.spheron.ai. Navigate to GPU Marketplace and rent an L40S on Spheron for batch jobs or A100 on Spheron for larger models or mixed workloads. Choose spot pricing for overnight batch runs and on-demand for sustained serving.
- SSH into the instance once provisioned. Verify CUDA with
nvidia-smi. You should see the L40S or A100 listed with full VRAM available. For SSH key setup and instance configuration, see docs.spheron.ai.
- Install dependencies:
pip install chronos-forecasting bentoml torch torchvision torchaudio
# or for Moirai
pip install uni2ts bentoml torch
# or for TimesFM
pip install timesfm[torch] bentoml- Sanity check the model on synthetic data:
import torch
from chronos import ChronosPipeline
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-large",
device_map="cuda",
torch_dtype=torch.bfloat16,
)
context = torch.randn(1, 168)
forecast = pipeline.predict(context, prediction_length=24, num_samples=100)
print(f"P50 forecast shape: {forecast.quantile(0.5, dim=1).shape}") # should be (1, 24)- For batch production serving, run the BentoML service from the earlier section:
bentoml serve service:ChronosService --port 8080 --workers 1- Keep the service alive across sessions using
screenortmux:
screen -S forecaster
bentoml serve service:ChronosService --port 8080
# Detach with Ctrl+A, D- For overnight batch jobs on spot instances, checkpoint progress to disk after every N series so an interrupted job can resume. Write completed series indices to a file and skip them on restart. Use an atomic write (temp file +
os.replace) so a preemption betweenopen()andjson.dump()does not leave an empty or partial checkpoint that crashes the next run:
import json, os
CHECKPOINT = "completed_series.json"
if os.path.exists(CHECKPOINT):
with open(CHECKPOINT) as f:
completed = set(json.load(f))
else:
completed = set()
for idx, series in enumerate(all_series):
if idx in completed:
continue
result = pipeline.predict(series, prediction_length=24)
save_result(idx, result)
completed.add(idx)
tmp = CHECKPOINT + ".tmp"
with open(tmp, "w") as f:
json.dump(list(completed), f)
os.replace(tmp, CHECKPOINT) # atomic swap — safe on preemptionPitfalls and Known Limitations
- Covariate handling: Chronos and Lag-Llama are univariate-only. Each series is predicted independently with no external regressors. Moirai supports covariates via its patch encoder. If promotions, holidays, or external signals are important for your forecast accuracy, use Moirai or pre-encode the covariates into the time series context before passing it to Chronos.
- Irregular timestamps: Standard Chronos assumes regular frequency. Pass irregularly sampled data to Chronos and it will misinterpret the gaps as part of the signal. Use Moirai for series with gaps or variable sampling rates.
- Long-horizon degradation: All four models degrade beyond their trained maximum prediction length. Chronos-T5 at
prediction_length > 64shows increasing uncertainty spread. TimesFM is more stable to 512 steps. Validate on your held-out horizon before committing to production.
- Cold-start on very short series: Models need at least 50-100 historical observations for reliable zero-shot output. Below that threshold, statistical methods (Holt-Winters, ARIMA) typically outperform foundation models. Do not use zero-shot foundation models on series with fewer than 50 data points without validating on held-out data.
- Memory pressure with large batches and long context: Moirai with
context_length=5000andbatch_size=64will OOM on a 48 GB card. In practice, context above 1,000 steps withbatch_size > 16is risky. Profile memory before setting production batch sizes:
torch.cuda.memory_summary(device="cuda", abbreviated=True)Reduce batch_size if you see memory pressure warnings.
- Chronos-Bolt vs Chronos-T5: Chronos-Bolt (
amazon/chronos-bolt-small,amazon/chronos-bolt-base, etc.) is the distilled family. It is up to 250x faster and 20x more memory-efficient than Chronos-T5, at a modest accuracy tradeoff. For most production batch jobs, Chronos-Bolt-Base is the better default: faster, cheaper, and accurate enough for standard seasonality patterns.
- Moirai commercial license restriction: Moirai is licensed under CC BY-NC 4.0, which prohibits commercial use without a commercial license from Salesforce. Do not deploy Moirai in revenue-generating production pipelines without confirming your licensing compliance. Chronos, TimesFM, and Lag-Llama are Apache 2.0 and commercially unrestricted.
Time series foundation models shift batch forecasting from a managed-service line item to a self-hosted GPU workload. For overnight demand planning runs across thousands of SKUs, a single L40S on-demand instance on Spheron delivers cost-per-million-forecasts well below AWS Forecast. Rent an A100 80GB for larger models or mixed workloads.
