What is the difference between LLM observability and GPU monitoring?

GPU monitoring tracks hardware metrics: utilization, VRAM, temperature, and throughput. LLM observability sits one level up - it traces individual prompts and responses, measures per-request latency, counts tokens, records eval scores, and links application-level data to the underlying GPU metrics. You need both to diagnose whether a latency spike is a model problem or a GPU contention problem.

Can I self-host Langfuse on Spheron GPU Cloud?

Yes. Langfuse's self-hosted tier is MIT-licensed and runs as a Docker container with a Postgres backend. A 4-core CPU node with 16 GB RAM handles millions of spans per day for small-scale deployments. Production Langfuse v3 setups also need Redis for caching and queuing, and S3-compatible blob storage for event persistence. You keep full control over prompt data without sending it to Langfuse's SaaS servers.

How does Arize Phoenix compare to Langfuse for LLM tracing?

Langfuse is better suited for production tracing at scale with a mature multi-tenant architecture and broad SDK support (LangChain, LiteLLM, OpenAI). Arize Phoenix is stronger for offline evaluation - it ships with built-in eval templates and a local-first analysis UI. Both run fully self-hosted.

What does a self-hosted observability stack cost versus SaaS?

Managed tiers like Langfuse Cloud charge per span or seat - costs reach $500-2000/month at scale. A self-hosted stack on Spheron (Postgres + ClickHouse + observability server on a CPU node) runs under $150/month while retaining full prompt data ownership.

Does self-hosted LLM observability satisfy EU AI Act logging requirements?

Yes. EU AI Act Article 12 (high-risk systems) requires detailed audit logging, and Article 52 covers transparency obligations. Self-hosted observability gives verifiable control over where logs are stored and who can access them, which is difficult to demonstrate with SaaS tooling that sends data to third-party servers.

LLM Observability on GPU Cloud: Deploy Langfuse, Arize Phoenix, and Helicone for Self-Hosted AI Tracing (2026 Guide)

GPU-level failures are loud. When a node runs out of VRAM, hits an XID error, or triggers thermal throttling, your monitoring stack catches it immediately. But the failures that hurt production LLM systems most are silent: a specific prompt template that causes p99 latency to spike, a model version change that subtly degrades output quality, or a token budget that quietly inflates costs by 40% over three weeks. None of these show up in a Grafana GPU dashboard. That gap is what LLM observability fills.

If you haven't set up GPU-level monitoring yet, start with our GPU monitoring for ML guide first - this post assumes DCGM and Prometheus are already running.

What LLM Observability Actually Tracks

LLM observability captures four categories of signal that infrastructure monitoring misses entirely.

Span traces record each request as a structured event: request ID, model name, prompt text, response text, timestamp, and duration. A trace may span multiple hops (router, inference server, post-processor), and the trace tree shows where time was actually spent.

Token metrics track prompt token count, completion token count, and derived cost per request. Over time, you see cost distribution by user tier, prompt template, and model version. A prompt that costs $0.002 in isolation costs $200,000 at 100M calls.

Quality signals include eval scores from LLM-as-judge pipelines, hallucination flags from factual grounding checks, and human annotation labels when you run spot audits. These tell you whether the model is doing the job, not just whether it's responding.

Application context attaches user ID, session ID, environment tag, and version tag to every span. This is what makes debugging fast: you search for a specific session, see every request in order, and find the exact exchange where things went wrong.

GPU monitoring gives you gpu_sm_utilization and gpu_memory_used_bytes. LLM observability gives you p95_ttft broken down by prompt length bucket, model version, and user tier. Tying these signals back to your LLM-as-judge evaluation pipeline is how you catch quality drift before users notice it.

Langfuse vs Arize Phoenix vs Helicone vs LangSmith

Feature	Langfuse	Arize Phoenix	Helicone	LangSmith
Self-host support	Yes (MIT)	Yes (ELv2)	Yes (Apache 2.0)	No (SaaS only)
Storage backend	Postgres + ClickHouse	SQLite / PostgreSQL	Postgres + ClickHouse	Hosted
SDK support	Python, JS, LangChain, LiteLLM, OpenAI	Python, OpenTelemetry	OpenAI-compatible proxy	LangChain, Python
Eval built-ins	Yes (rubric eval, annotation)	Yes (built-in eval templates)	Basic scoring	Yes (datasets, feedback)
OpenTelemetry native	Yes	Yes	Proxy-based	No
Multi-tenant	Yes	Limited	Yes	Yes
UI quality	Production-grade	Analysis-focused	Dashboard-focused	Production-grade
EU data residency	Self-hosted (your infra)	Self-hosted (your infra)	Self-hosted (your infra)	No (US-hosted)
Active OSS community	High	High	Medium	Proprietary

For teams running vLLM or SGLang in production with multi-tenant isolation requirements, Langfuse is the strongest choice: mature architecture, broad SDK support, and ClickHouse for high-volume storage. Arize Phoenix suits teams doing heavy offline evaluation and experimentation - the built-in eval templates and local-first analysis UI make it faster to iterate on rubrics. Helicone works well as a drop-in proxy when you cannot modify inference server code to add SDK instrumentation; you route all traffic through a self-hosted Helicone instance and get centralized logging without touching the model server. LangSmith is excluded from self-host consideration since it has no self-hosted deployment option.

Reference Architecture: Tracing Collector and Storage on Spheron

A minimal two-node setup works well for most teams:

Node A (GPU inference): vLLM or SGLang inference server, DCGM Exporter on port 9400, and Prometheus node exporter. This node does the model work; it ships traces and metrics outbound to Node B.

Node B (observability): Langfuse server or Phoenix server, Postgres as the primary store, optional ClickHouse for high-volume trace storage, and Grafana for dashboards. A 4-core CPU node with 16 GB RAM handles this stack comfortably up to 5M spans/day.

Docker Compose for Node B deploying Langfuse with Postgres (minimal setup suitable for small-scale or demo deployments; production Langfuse v3 deployments also require Redis for caching and queuing, and S3-compatible blob storage for event persistence):

yaml

# docker-compose.yml for observability node (Node B)
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: langfuse
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data

  langfuse:
    image: langfuse/langfuse:latest
    depends_on: [postgres]
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://langfuse:${POSTGRES_PASSWORD}@postgres:5432/langfuse
      NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
      NEXTAUTH_URL: http://<node-b-ip>:3000
      SALT: ${SALT}

volumes:
  pgdata:

For deployments handling 1M+ spans/day, add ClickHouse as a secondary store for analytics queries:

yaml

# Add to docker-compose.yml for 1M+ spans/day
  clickhouse:
    image: clickhouse/clickhouse-server:24.3
    # No external ports: ClickHouse is only reachable within the Docker network via service name
    environment:
      CLICKHOUSE_USER: ${CLICKHOUSE_USER}
      CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD}
    volumes:
      - chdata:/var/lib/clickhouse
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

  langfuse-worker:
    image: langfuse/langfuse-worker:latest
    depends_on: [postgres, clickhouse]
    environment:
      DATABASE_URL: postgresql://langfuse:${POSTGRES_PASSWORD}@postgres:5432/langfuse
      CLICKHOUSE_URL: http://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@clickhouse:8123
      LANGFUSE_INGESTION_MAX_REQUEST_BODY_SIZE_MB: 4

volumes:
  chdata:

Instrumenting vLLM, SGLang, and TGI

vLLM

vLLM 0.4+ emits gen_ai.* semantic convention spans natively via OpenTelemetry. Set three environment variables on the inference node:

bash

OTEL_EXPORTER_OTLP_ENDPOINT=http://<node-b-ip>:4318
OTEL_SERVICE_NAME=vllm-inference
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,model.name=meta-llama-3-70b

For vLLM versions below 0.4, wrap the OpenAI-compatible endpoint with the opentelemetry-instrumentation-openai package instead. See our vLLM production deployment guide for full server configuration.

SGLang

SGLang exposes a Prometheus metrics endpoint at /metrics by default when started with --enable-metrics. To unify with trace data, pipe it through an OpenTelemetry Collector configured with a Prometheus receiver:

yaml

# otel-collector-config.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: sglang
          static_configs:
            - targets: ["localhost:30000"]

exporters:
  otlp:
    endpoint: http://<node-b-ip>:4317

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [otlp]

For SGLang-specific tuning, refer to the SGLang production deployment guide.

TGI

Text Generation Inference has built-in OpenTelemetry support via the --otlp-endpoint flag:

bash

docker run --gpus all ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3-70b-instruct \
  --otlp-endpoint http://<node-b-ip>:4318

TGI spans include generation latency, queue wait time, and token counts out of the box with no additional instrumentation.

Correlating Traces with GPU Metrics

This is where observability closes the loop. The strategy is straightforward:

DCGM Exporter exposes metrics labeled with gpu_uuid and gpu_index
vLLM and SGLang log a request_id in every span
The correlation key is time range plus GPU index: given a request with start_time=T, duration=D, query DCGM for avg(gpu_sm_utilization)[T:T+D] on the same GPU

In practice you build this correlation in Grafana by templating dashboards that accept a time range from a trace link. Click a slow span in Langfuse, copy the time range, and query DCGM metrics for that window.

The patterns this reveals:

Trace signal	GPU signal	Root cause
High p95 TTFT, normal TBT	Low SM utilization	CPU-bound tokenizer or network I/O
High TBT, normal TTFT	SM util > 95%	Compute-bound decode, batch too large
High TTFT and TBT	VRAM > 90% used	KV cache eviction under memory pressure
Normal latency, high TTFT variance	Power state P1 vs P0 switching	GPU clock throttling under light load
Request errors	`DCGM_FI_DEV_XID_ERRORS` > 0	Hardware-level fault, isolate node

For the DCGM Exporter setup and Prometheus alert rules, see the GPU monitoring for ML guide.

Storage and Retention at 10M+ Spans/Day

A vLLM server handling 100 requests/second generates 8.64M requests/day. Each span with prompt and response payloads runs 2-5 KB raw. That is roughly 25 GB/day before compression.

ClickHouse compresses trace data at roughly 10:1, so the actual stored volume is closer to 2.5 GB/day. A tiered retention strategy handles this without blowing up costs:

Hot tier (ClickHouse): 7 days of full spans with prompt/response payloads. Roughly 17.5 GB at 2.5 GB/day compressed. Query latency under 1 second for analytics.

Warm tier (Postgres): Aggregated stats only, no raw payloads, 90 days. Roughly 10 GB. Fast for dashboards and alert queries.

Cold tier (S3-compatible object storage): Compressed raw spans, 1 year. Roughly 500 GB for 10M spans/day. Restore on demand for audits.

Node sizing for the hot tier: a 4-core CPU node with 500 GB NVMe SSD handles this comfortably. For the ClickHouse node at 10M+ spans/day, size up to 8-core/32 GB with 500 GB NVMe.

To reduce storage, truncate prompt text at 4K characters and response text at 8K characters. In Langfuse, set LANGFUSE_INGESTION_MAX_REQUEST_BODY_SIZE_MB to limit per-request payload size.

Privacy and EU AI Act Compliance

Every SaaS observability platform sends your prompt and response data to their servers. Langfuse Cloud, Arize Cloud, LangSmith all store your inference data in their infrastructure, under their data retention policies, in their jurisdiction. For regulated workloads, that is a hard compliance problem.

Two specific EU AI Act requirements hit directly here:

Article 12 requires high-risk AI systems to keep logs enabling post-hoc monitoring of the system's operation. The logs must be retained and accessible to competent authorities.

Article 52 covers transparency obligations for certain AI interactions, requiring that outputs be traceable to the system that generated them.

Self-hosting gives you three things SaaS cannot: verifiable data location, access control audits, and data deletion on demand. These are exactly what compliance auditors ask for. You can produce a list of every person who accessed the trace data, when, and from what IP. You can delete a specific user's data on GDPR request without waiting for a vendor to act.

Helicone's proxy-based architecture deserves a mention here: for teams that cannot modify inference server code, routing all traffic through a self-hosted Helicone proxy gives centralized per-request logging without SDK changes. Every request passes through the proxy, which logs to your Postgres instance.

For a full breakdown of EU AI Act requirements and how to structure your GPU deployment to comply, see our EU AI Act compliance guide.

Spheron Pricing for a Small Observability Stack

The observability layer costs a fraction of what SaaS platforms charge at the same volume.

Observability node (Node B): A 4-core/16 GB CPU node running Langfuse and Postgres handles up to 5M spans/day. Cost is roughly $50-80/month depending on disk size.

High-volume ClickHouse node: An 8-core/32 GB node with 500 GB NVMe SSD handles 10M+ spans/day in the hot tier.

GPU inference node (Node A): GPU pricing depends on the model size and throughput requirements. H100 SXM5 nodes start at $4.41/hr on-demand and L40S nodes start at $0.72/hr on-demand for smaller workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Setup	Monthly cost	Prompt data location
Langfuse Cloud (10M events)	~$800-1500	Langfuse servers (US)
Arize Cloud (10M spans)	~$600-2000	Arize servers (US)
Self-hosted on Spheron (Langfuse + ClickHouse)	~$100-200	Your Spheron nodes

For GPU inference workloads running alongside the observability stack, H100 GPU rental on Spheron starts at $4.41/hr on-demand.

Self-hosting observability means your prompt and response data stays on infrastructure you control - critical when you're working with regulated data or need audit trails for EU AI Act compliance. Run both the LLM workload and its full observability stack on Spheron without paying SaaS premiums for data you generated.
Rent H100 → | View GPU pricing → | Get started on Spheron →

What LLM Observability Actually Tracks

Langfuse vs Arize Phoenix vs Helicone vs LangSmith

Reference Architecture: Tracing Collector and Storage on Spheron

Instrumenting vLLM, SGLang, and TGI

vLLM

SGLang

TGI

Correlating Traces with GPU Metrics

Storage and Retention at 10M+ Spans/Day

Privacy and EU AI Act Compliance

Spheron Pricing for a Small Observability Stack

Build what's next.