GPU spend is now the #1 FinOps concern for AI-first organizations, surpassing general cloud costs for the first time in the FinOps Foundation's 2026 State of FinOps report. The problem isn't just the size of the bill - it's that nobody knows which team owns which slice of it. This guide covers the four cost-allocation models, the five-dimension tagging schema, a Prometheus/DCGM/Grafana monitoring stack, and the chargeback math that actually works for enterprise AI teams. For the tactical side of reducing that bill once you can see it, see the GPU cost optimization playbook and AI inference cost economics.
Why Traditional Cloud FinOps Breaks for GPU Workloads
AWS Cost Explorer, GCP Billing Console, and Azure Cost Management all operate at the instance level. They can tell you that your p5.48xlarge cost roughly $39,600 last month (AWS us-east-1 list price, June 2026). They cannot tell you that Team NLP consumed 40% of it and Team CV consumed 25%.
Three structural problems make hyperscaler FinOps tooling inadequate for GPU workloads:
Tag propagation stops at the instance. A team=nlp label on a Kubernetes pod does not appear in DCGM Exporter GPU metrics by default. Tags on compute instances don't flow into GPU utilization counters or VRAM usage time-series. You have to configure that join explicitly in Prometheus relabeling rules, and most teams don't.
Multi-tenant inference hides attribution. One vLLM pod serving five teams shows up as a single line item in your billing console. The $23,000 monthly inference bill is attributed to the infrastructure team that provisioned it, not the product teams consuming it. For a detailed look at the billing structure that causes this, see how to avoid unexpected AWS costs.
Reserved instance credit pools obscure per-team cost. If your org uses AWS Savings Plans or committed-use discounts, the credits get applied across the account before individual cost centers see their spend. A team that budgeted $8,000/month might see $5,200 after credits are applied - but those credits were earned by a different team's reserved workload. Chargeback math becomes an accounting exercise, not an engineering one.
GPU cloud egress compounds the problem further. The GPU cloud egress costs guide documents how egress can add 15-25% on top of raw compute costs - costs that appear in a separate line item and are frequently missed in team-level attribution.
The Four Cost-Allocation Models for AI Infrastructure
There's no single right approach. The correct model depends on how your teams are structured and what granularity finance needs for chargeback. Most organizations end up using two models simultaneously.
Per-Namespace (Kubernetes Teams)
One Kubernetes namespace per team, with DCGM metrics relabeled using the namespace label. This is the simplest implementation and works well for orgs with 3-8 teams that each own distinct namespaces.
Attribution formula:
team_cost = (namespace_gpu_seconds / total_gpu_seconds) * cluster_monthly_costLimitations: shared infrastructure in kube-system, gpu-operator, or monitoring namespaces is unattributed overhead. You'll need to decide whether to distribute that overhead proportionally or absorb it into a platform budget.
Per-Tenant (Multi-Tenant Inference)
Multiple teams share one or more inference pods. This requires a proxy layer - LiteLLM, Kong, or nginx - to inject tenant headers before requests reach vLLM. Attribution is either token-weighted (proportional to token volume) or request-weighted (proportional to request count).
Token weighting is almost always more accurate. A team running large 4K-context requests consumes far more GPU time per request than a team running short classification tasks.
Per-Token (Model-Level)
The most granular approach, and the most operationally complex. Use vLLM's /metrics endpoint: vllm:prompt_tokens_total and vllm:generation_tokens_total, labeled by model name and request ID.
Cost per token:
cost_per_token = gpu_cost_per_second * avg_gpu_seconds_per_tokenUseful for teams that want to bill internal product teams or external API customers by the token. The operational overhead is significant: you need to track latency per request, join it with GPU utilization, and handle edge cases like prefix caching (tokens processed once but logged as multiple requests).
For KV cache optimization strategies that affect token attribution, see the LLM serving optimization guide.
Per-Experiment (Training Attribution)
Training run cost is straightforward to compute: GPU count hours to convergence GPU hourly rate. The challenge is tagging.
Tag Kubernetes batch jobs:
metadata:
labels:
team: "nlp"
project: "llama-70b-finetune"
experiment: "lr-sweep-42"
cost-center: "product-ai"Tag Slurm jobs with the --comment field:
#SBATCH --job-name=llama-finetune
#SBATCH --comment='{"team":"nlp","project":"llama-70b-finetune","experiment":"run-42"}'Parse with sacct -j <job_id> --format=JobID,Comment,Elapsed,AllocGRES.
Tag Ray clusters:
ray.init(runtime_env={"env_vars": {"TEAM": "nlp", "PROJECT": "llama-finetune"}})Comparison:
| Model | Granularity | Complexity | Best for |
|---|---|---|---|
| Per-Namespace | Team | Low | Small orgs with namespace isolation |
| Per-Tenant | Team + service | Medium | Multi-tenant inference platforms |
| Per-Token | Model + team | High | API billing, LLM platforms |
| Per-Experiment | Run-level | Medium | Training workloads |
Tagging Strategy for GPU Cloud Workloads
The Five Core Tag Dimensions
Five tag dimensions cover 95% of chargeback use cases. Apply them consistently across every layer of your stack:
team- owning group (e.g.,team=nlp,team=cv,team=platform)project- specific initiative (e.g.,project=llama-finetune,project=prod-inference)environment-env=dev,env=staging,env=prodmodel-name- model variant being served (e.g.,model=llama-3-70b,model=mistral-8x22b)cost-center- finance system mapping (e.g.,cc=product-ai,cc=platform-infra)
Missing any of these creates attribution gaps. A job tagged with team but not project can be charged to the right group but not the right initiative. A job tagged with both but not cost-center can't be reconciled with your finance system's budget lines.
Tag Propagation Through the Stack
The propagation chain: Kubernetes namespace labels -> Pod labels -> DCGM Exporter relabeling -> Prometheus -> Grafana.
DCGM Exporter does not automatically join Kubernetes pod labels onto GPU metrics. You need a relabeling rule in your Prometheus scrape config that reads pod labels from kube-state-metrics and applies them to DCGM time series:
# prometheus-scrape-config.yaml
- job_name: dcgm-exporter
scrape_interval: 15s
static_configs:
- targets: ['dcgm-exporter:9400']
metric_relabel_configs:
- source_labels: [exported_pod, exported_namespace]
target_label: team
regex: '.*;(nlp|cv|platform|research|product)'
replacement: '$1'For vLLM, set VLLM_SERVED_MODEL_NAME as an environment variable. This value appears in the model label on vLLM's Prometheus metrics, which you can join with GPU cost data in Grafana.
For Slurm, the sacct job accounting database stores the --comment JSON you set at submission. A nightly ETL job that reads sacct output and pushes it to your cost dashboard is enough for most research orgs.
Building a Per-Project Chargeback Dashboard
The Observability Stack
Four components are needed for a complete GPU chargeback dashboard:
- DCGM Exporter (DaemonSet) - GPU utilization, memory, temperature per pod
- Prometheus - scrapes DCGM + vLLM metrics, stores time-series with team/project labels
- Grafana - visualization, alerting, and cost attribution table panel
- OpenCost or Kubecost (optional) - Kubernetes cost allocation layer that reads Prometheus and maps to billing
OpenCost is worth adding once you're beyond 3-4 teams. It automates the join between Kubernetes resource usage and provider pricing, reducing the amount of custom PromQL you need to write.
A sample Prometheus scrape config showing how to relabel DCGM metrics with Kubernetes pod labels:
scrape_configs:
- job_name: 'dcgm-exporter'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_team]
target_label: team
- source_labels: [__meta_kubernetes_pod_label_project]
target_label: project
- source_labels: [__meta_kubernetes_pod_label_cost_center]
target_label: cost_centerGPU Cost Attribution Formula
Daily team cost from Prometheus metrics:
daily_team_cost = SUM(gpu_seconds_by_pod{team="nlp"}) / 3600 * gpu_hourly_rateFor Spheron, gpu_hourly_rate is the on-demand or spot rate from the Spheron billing API (or the /pricing/ page). For spot instances, use the actual rate at each interval - spot prices can shift during a long training job, so time-weighted averaging is more accurate than a single rate.
Token Counter Integration
Add vLLM's /metrics endpoint to your Prometheus scrape config:
- job_name: 'vllm'
static_configs:
- targets: ['vllm-service:8000']
metrics_path: '/metrics'Key metrics: vllm:prompt_tokens_total and vllm:generation_tokens_total. Build a Grafana panel with a stacked bar showing prompt and generation tokens by team label over time.
Cost per million tokens:
cost_per_million_tokens = (gpu_cost_per_hour / ((prompt_tokens + generation_tokens) * 3600)) * 1_000_000Showback vs Chargeback: When to Bill Real Money
Showback is attribution-only reporting. Teams see what they would be charged, but no budget transfer happens. It's a reporting mechanism, not a billing mechanism. Most orgs start here.
Chargeback transfers the actual cost to the team's budget or cost center via a finance system integration. This requires buy-in from finance and team leads, accurate tagging across all workloads, and agreement on how to handle shared infrastructure overhead.
Recommended rollout: 4-6 weeks of showback to validate tagging coverage, then switch to chargeback.
Warning signs that chargeback is premature:
| Signal | Implication |
|---|---|
| Tagging coverage below 80% | 20%+ of costs will be unattributed - teams will dispute the numbers |
| No team-level budgets in finance | Chargeback has nowhere to land - finance can't book the transfer |
| Unresolved shared infra overhead | Platform team gets stuck with unattributed costs nobody owns |
| Tags inconsistent across dev/staging/prod | Attribution breaks across environments, producing garbage reports |
Start showback immediately when you have tagging in place. Move to chargeback only when the above signals are resolved.
Per-Token Attribution for Shared LLM Endpoints
The most common multi-tenant setup: 5 teams sharing one vLLM instance running Llama 3.1 70B on an 8x H100 SXM5 node.
Architecture: LiteLLM proxy in front of vLLM, configured with virtual keys per team. Each virtual key carries a x-team-id and x-project-id header that LiteLLM injects on every request before forwarding to vLLM. LiteLLM's request logs, combined with vLLM's Prometheus metrics, give you per-team token volumes.
Attribution math:
team_hourly_cost = (team_tokens_per_hour / total_tokens_per_hour) * node_gpu_cost_per_hourWorked example: 8x H100 SXM5 on Spheron at $3.92/GPU/hr on-demand = $31.36/hr for the node. At spot pricing ($2.91/GPU/hr), the same node costs $23.28/hr. Running Llama 3.1 70B with vLLM, the cluster handles roughly 10M tokens/hr across all teams.
| Team | Token share | Hourly cost attribution |
|---|---|---|
| NLP | 40% (4M tokens/hr) | $12.54/hr |
| CV | 25% (2.5M tokens/hr) | $7.84/hr |
| Platform | 15% (1.5M tokens/hr) | $4.70/hr |
| Research | 12% (1.2M tokens/hr) | $3.76/hr |
| Product | 8% (0.8M tokens/hr) | $2.51/hr |
| Total | 100% | $31.36/hr |
Edge cases:
Prefix caching: vLLM's prefix cache processes shared prompt prefixes once and reuses the KV cache for subsequent requests. The tokens are processed once but logged against the first requester. The simplest workaround is to attribute prefix-cached tokens to a platform overhead pool rather than any single team, then distribute the overhead proportionally.
KV cache sharing across long context requests introduces similar accounting complexity. For workloads where this matters, track vllm:num_preemptions_total to estimate how often requests get preempted and KV cache evicted - high preemption rates usually indicate the attribution is already approximate anyway.
Training Run Cost Attribution
Tagging Kubernetes Jobs and Ray Clusters
Every training job should carry the five-dimension tag schema:
# K8s batch job with FinOps labels
apiVersion: batch/v1
kind: Job
metadata:
name: llama-70b-finetune-run-42
labels:
team: "nlp"
project: "llama-70b-finetune"
experiment: "lr-sweep-42"
cost-center: "product-ai"
environment: "prod"For Ray clusters:
ray.init(
runtime_env={
"env_vars": {
"TEAM": "nlp",
"PROJECT": "llama-70b-finetune",
"EXPERIMENT": "run-42",
"COST_CENTER": "product-ai"
}
}
)Final Attribution Formula
training_run_cost = gpu_count * wall_clock_hours * gpu_hourly_rateFor Spheron's per-second billing, use:
training_run_cost = gpu_count * wall_clock_seconds * gpu_per_second_rateThis matters for short jobs. A 40-minute training run on Spheron costs 40/60 of an hour's rate. On AWS's hourly billing, the same run costs a full hour. For orgs running many short experimental runs, this per-second billing model is inherently more chargeback-friendly: the math is always exact, with no rounding artifacts.
Reserved vs Spot vs On-Demand: Budgeting Rules
| Workload type | Recommended pricing | Rationale |
|---|---|---|
| Production inference (always-on) | On-demand or committed | No interruption risk, predictable baseline |
| Dev/staging inference | On-demand or spot | Cost matters, interruption acceptable |
| Training runs (long, stateful) | Spot with checkpointing | 60-70% savings, resume from checkpoint on preemption |
| Training runs (short, under 4 hrs) | On-demand | Checkpoint overhead not worth it |
| Batch inference (async) | Spot | Interruption-safe by design |
Budget rule: Reserve at your P5 utilization floor - the GPU count you run 95% of the time - and use spot or on-demand for burst capacity above that.
Committed-spend programs on hyperscalers (1-year, 3-year) create credit pools that complicate per-team attribution. A platform team that signs a 1-year H100 reserved instance deal distributes credits across all teams sharing the node. For a clean FinOps model, you need to track the committed-spend apportionment separately from the per-hour rate - which adds another layer to your attribution logic.
For a detailed breakdown of how billing models compare across workload types, see the serverless vs on-demand vs reserved GPU guide. For a cross-provider rate breakdown covering AWS, Lambda, RunPod, and others, see the GPU cloud pricing comparison for 2026.
Spheron FinOps Advantage: Per-Second Billing Makes Chargeback Math Straightforward
Hyperscaler GPU FinOps involves unpacking: reserved instance credits, savings plans, bundled networking and EBS charges, opaque support tiers, and cross-AZ data transfer fees. Each of these adds noise to per-team attribution and requires reconciliation that your platform team has to do manually every billing cycle.
Spheron's billing model eliminates most of that complexity. Per-second billing, transparent on-demand and spot rates per GPU, no bundled networking markup. The attribution formula for any team is:
team_monthly_cost = team_gpu_seconds * per_second_rateNo credits to reconcile, no committed-spend pools to apportion, no egress line items to allocate. Spheron aggregates GPU capacity from 5+ providers on the backend, but exposes it as clean, consistent rates on a single billing surface.
Worked Example: 5-Team Org Chargeback on Spheron vs AWS
30-day billing period. 5 teams sharing an 8x H100 SXM5 cluster at full utilization.
Spheron on-demand rate: $3.92/GPU/hr (H100 SXM5, from live Spheron API, June 2026)
Spheron spot rate: $2.91/GPU/hr (H100 SXM5, from live Spheron API, June 2026)
AWS p5.48xlarge rate: $6.88/GPU/hr (H100 SXM5 on-demand, us-east-1)
Total cluster cost per month (720 hours):
- Spheron on-demand: 8 GPUs $3.92/hr 720 hrs = $22,579.20
- Spheron spot: 8 GPUs $2.91/hr 720 hrs = $16,761.60
- AWS: 8 GPUs $6.88/hr 720 hrs = $39,628.80
Per-team attribution (proportional GPU-hour usage, on-demand rates):
| Team | GPU-Hour Share | Spheron Monthly Cost | AWS Monthly Cost |
|---|---|---|---|
| NLP | 40% (2,304 GPU-hrs) | $9,031.68 | $15,851.52 |
| CV | 25% (1,440 GPU-hrs) | $5,644.80 | $9,907.20 |
| Platform | 15% (864 GPU-hrs) | $3,386.88 | $5,944.32 |
| Research | 12% (691 GPU-hrs) | $2,708.72 | $4,755.46 |
| Product | 8% (461 GPU-hrs) | $1,807.12 | $3,170.30 |
| Total (on-demand) | 100% | $22,579.20 | $39,628.80 |
| Total (spot) | 100% | $16,761.60 | N/A |
The AWS figure above is raw compute only. Add data transfer ($900-$2,000/month for a high-traffic inference endpoint), EBS storage ($0.10-$0.20/GB/month for model checkpoints), and cross-AZ networking charges (all AWS list prices, us-east-1, June 2026) - and the real AWS total for this 5-team org is closer to $43,000-$44,000/month.
On Spheron, the chargeback math per team stays exactly as shown. No ancillary line items to reconcile.
Pricing fluctuates based on GPU availability. The prices above are based on 02 Jun 2026 and may have changed. Check current GPU pricing → for live rates.
Conclusion
GPU FinOps requires GPU-native tooling. Cost Explorer extensions, tag policies, and savings plan dashboards were built for EC2 instances, not GPU clusters shared across inference and training workloads. The four models - per-namespace, per-tenant, per-token, and per-experiment - give you the right granularity for each part of your stack. The five-dimension tagging schema, deployed consistently from Kubernetes labels through DCGM Exporter to Prometheus, is what makes those models produce accurate numbers. Start with showback, run it for 4-6 weeks to find the tagging gaps, then move to chargeback once the coverage is above 80%.
GPU FinOps starts with transparent per-second billing you can actually do math on. Spheron exposes clean per-GPU rates across on-demand and spot tiers with no bundled charges to untangle.
H100 GPU pricing on Spheron → | Compare all GPU rates → | Get started →
Quick Setup Guide
Decide which allocation granularity your org needs: per-namespace (Kubernetes teams), per-tenant (multi-tenant inference), per-token (model billing), or per-experiment (training attribution). Most orgs need at least two.
Apply team, project, environment, model-name, and cost-center labels to every Kubernetes namespace, node pool, and vLLM deployment. Verify tag propagation by querying DCGM Exporter metrics and confirming labels appear on gpu_utilization_gauge and gpu_memory_used_bytes.
Install DCGM Exporter as a DaemonSet. Configure Prometheus to scrape DCGM metrics with relabeling rules that join pod labels (team, project) onto GPU metrics. Build a Grafana dashboard with three panels: GPU utilization by team, GPU memory by project, and cost attribution table.
Place a reverse proxy (LiteLLM or Kong) in front of each shared vLLM endpoint. Configure it to inject x-team-id and x-project-id headers. Export token counters to Prometheus using the vLLM metrics endpoint (/metrics). Join token counts with GPU cost in Grafana.
For on-demand: attributed_cost = (team_gpu_seconds / total_gpu_seconds) * gpu_hourly_rate. For spot: use actual per-second billing from the Spheron billing API. Run this calculation weekly or monthly and push results to your finance system or internal cost dashboard.
Start by sending teams weekly showback reports via email or Slack. Run showback for 4-6 weeks to build trust in the numbers and surface tagging gaps. Once teams trust the data, switch to formal chargeback by integrating with your finance system's cost center codes.
Frequently Asked Questions
AWS Cost Explorer allocates cost at the instance level, not the GPU-hour level. When multiple teams share a GPU node or a multi-tenant vLLM endpoint, Cost Explorer can't split the bill by team or model. Tags don't propagate to GPU-hour counters, and there's no native per-token attribution for shared inference endpoints.
Showback gives teams visibility into what they would be charged - a reporting mechanism without real money moving. Chargeback transfers the actual cost to the team's budget or cost center. Showback is the starting point for most organizations; chargeback typically requires finance integration and internal billing agreements.
Deploy vLLM with an API gateway (LiteLLM, Kong, or Nginx) that injects a team or project header on every request. Export vLLM's built-in token counter metrics to Prometheus. Use a Grafana dashboard to join GPU-hour cost (from DCGM exporter) with token counts per team and compute attributed cost as: (team_tokens / total_tokens) * gpu_hour_cost.
Five dimensions cover 95% of chargeback use cases: team (which group owns the workload), project (the specific initiative or product), environment (dev/staging/prod), model-name (which LLM or model variant is running), and cost-center (the finance mapping for chargeback). Apply all five as Kubernetes namespace labels, node labels, and vLLM server tags so they propagate through the entire observability stack.
Spheron bills per-second with transparent on-demand and spot rates per GPU. There are no bundled networking charges, no reserved instance credits to reconcile, and no opaque commit tiers. For a 5-team org doing per-project chargeback, the attribution formula is: team_gpu_hours * per_second_rate * 3600. On AWS, you also need to unpack EBS, data transfer, support charges, and resolve reserved capacity across multiple accounts.
