Tutorial

LLM Observability on GPU Cloud: Deploy Langfuse, Arize Phoenix, and Helicone for Self-Hosted AI Tracing (2026 Guide)

Back to BlogWritten by Mitrasish, Co-founderApr 25, 2026
LLM Observability Self HostedLangfuse GPU Cloud DeploymentArize Phoenix Self HostHelicone Self HostLLM Tracing InfrastructurevLLM ObservabilitySGLang ObservabilityOpenTelemetry LLMGPU CloudAI Infrastructure
LLM Observability on GPU Cloud: Deploy Langfuse, Arize Phoenix, and Helicone for Self-Hosted AI Tracing (2026 Guide)

GPU-level failures are loud. When a node runs out of VRAM, hits an XID error, or triggers thermal throttling, your monitoring stack catches it immediately. But the failures that hurt production LLM systems most are silent: a specific prompt template that causes p99 latency to spike, a model version change that subtly degrades output quality, or a token budget that quietly inflates costs by 40% over three weeks. None of these show up in a Grafana GPU dashboard. That gap is what LLM observability fills.

If you haven't set up GPU-level monitoring yet, start with our GPU monitoring for ML guide first - this post assumes DCGM and Prometheus are already running.

What LLM Observability Actually Tracks

LLM observability captures four categories of signal that infrastructure monitoring misses entirely.

Span traces record each request as a structured event: request ID, model name, prompt text, response text, timestamp, and duration. A trace may span multiple hops (router, inference server, post-processor), and the trace tree shows where time was actually spent.

Token metrics track prompt token count, completion token count, and derived cost per request. Over time, you see cost distribution by user tier, prompt template, and model version. A prompt that costs $0.002 in isolation costs $200,000 at 100M calls.

Quality signals include eval scores from LLM-as-judge pipelines, hallucination flags from factual grounding checks, and human annotation labels when you run spot audits. These tell you whether the model is doing the job, not just whether it's responding.

Application context attaches user ID, session ID, environment tag, and version tag to every span. This is what makes debugging fast: you search for a specific session, see every request in order, and find the exact exchange where things went wrong.

GPU monitoring gives you gpu_sm_utilization and gpu_memory_used_bytes. LLM observability gives you p95_ttft broken down by prompt length bucket, model version, and user tier. Tying these signals back to your LLM-as-judge evaluation pipeline is how you catch quality drift before users notice it.

Langfuse vs Arize Phoenix vs Helicone vs LangSmith

FeatureLangfuseArize PhoenixHeliconeLangSmith
Self-host supportYes (MIT)Yes (ELv2)Yes (Apache 2.0)No (SaaS only)
Storage backendPostgres + ClickHouseSQLite / PostgreSQLPostgres + ClickHouseHosted
SDK supportPython, JS, LangChain, LiteLLM, OpenAIPython, OpenTelemetryOpenAI-compatible proxyLangChain, Python
Eval built-insYes (rubric eval, annotation)Yes (built-in eval templates)Basic scoringYes (datasets, feedback)
OpenTelemetry nativeYesYesProxy-basedNo
Multi-tenantYesLimitedYesYes
UI qualityProduction-gradeAnalysis-focusedDashboard-focusedProduction-grade
EU data residencySelf-hosted (your infra)Self-hosted (your infra)Self-hosted (your infra)No (US-hosted)
Active OSS communityHighHighMediumProprietary

For teams running vLLM or SGLang in production with multi-tenant isolation requirements, Langfuse is the strongest choice: mature architecture, broad SDK support, and ClickHouse for high-volume storage. Arize Phoenix suits teams doing heavy offline evaluation and experimentation - the built-in eval templates and local-first analysis UI make it faster to iterate on rubrics. Helicone works well as a drop-in proxy when you cannot modify inference server code to add SDK instrumentation; you route all traffic through a self-hosted Helicone instance and get centralized logging without touching the model server. LangSmith is excluded from self-host consideration since it has no self-hosted deployment option.

Reference Architecture: Tracing Collector and Storage on Spheron

A minimal two-node setup works well for most teams:

Node A (GPU inference): vLLM or SGLang inference server, DCGM Exporter on port 9400, and Prometheus node exporter. This node does the model work; it ships traces and metrics outbound to Node B.

Node B (observability): Langfuse server or Phoenix server, Postgres as the primary store, optional ClickHouse for high-volume trace storage, and Grafana for dashboards. A 4-core CPU node with 16 GB RAM handles this stack comfortably up to 5M spans/day.

Docker Compose for Node B deploying Langfuse with Postgres (minimal setup suitable for small-scale or demo deployments; production Langfuse v3 deployments also require Redis for caching and queuing, and S3-compatible blob storage for event persistence):

yaml
# docker-compose.yml for observability node (Node B)
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: langfuse
      POSTGRES_USER: langfuse
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data

  langfuse:
    image: langfuse/langfuse:latest
    depends_on: [postgres]
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://langfuse:${POSTGRES_PASSWORD}@postgres:5432/langfuse
      NEXTAUTH_SECRET: ${NEXTAUTH_SECRET}
      NEXTAUTH_URL: http://<node-b-ip>:3000
      SALT: ${SALT}

volumes:
  pgdata:

For deployments handling 1M+ spans/day, add ClickHouse as a secondary store for analytics queries:

yaml
# Add to docker-compose.yml for 1M+ spans/day
  clickhouse:
    image: clickhouse/clickhouse-server:24.3
    # No external ports: ClickHouse is only reachable within the Docker network via service name
    environment:
      CLICKHOUSE_USER: ${CLICKHOUSE_USER}
      CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD}
    volumes:
      - chdata:/var/lib/clickhouse
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

  langfuse-worker:
    image: langfuse/langfuse-worker:latest
    depends_on: [postgres, clickhouse]
    environment:
      DATABASE_URL: postgresql://langfuse:${POSTGRES_PASSWORD}@postgres:5432/langfuse
      CLICKHOUSE_URL: http://${CLICKHOUSE_USER}:${CLICKHOUSE_PASSWORD}@clickhouse:8123
      LANGFUSE_INGESTION_MAX_REQUEST_BODY_SIZE_MB: 4

volumes:
  chdata:

Instrumenting vLLM, SGLang, and TGI

vLLM

vLLM 0.4+ emits gen_ai.* semantic convention spans natively via OpenTelemetry. Set three environment variables on the inference node:

bash
OTEL_EXPORTER_OTLP_ENDPOINT=http://<node-b-ip>:4318
OTEL_SERVICE_NAME=vllm-inference
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,model.name=meta-llama-3-70b

For vLLM versions below 0.4, wrap the OpenAI-compatible endpoint with the opentelemetry-instrumentation-openai package instead. See our vLLM production deployment guide for full server configuration.

SGLang

SGLang exposes a Prometheus metrics endpoint at /metrics by default when started with --enable-metrics. To unify with trace data, pipe it through an OpenTelemetry Collector configured with a Prometheus receiver:

yaml
# otel-collector-config.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: sglang
          static_configs:
            - targets: ["localhost:30000"]

exporters:
  otlp:
    endpoint: http://<node-b-ip>:4317

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [otlp]

For SGLang-specific tuning, refer to the SGLang production deployment guide.

TGI

Text Generation Inference has built-in OpenTelemetry support via the --otlp-endpoint flag:

bash
docker run --gpus all ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3-70b-instruct \
  --otlp-endpoint http://<node-b-ip>:4318

TGI spans include generation latency, queue wait time, and token counts out of the box with no additional instrumentation.

Correlating Traces with GPU Metrics

This is where observability closes the loop. The strategy is straightforward:

  1. DCGM Exporter exposes metrics labeled with gpu_uuid and gpu_index
  2. vLLM and SGLang log a request_id in every span
  3. The correlation key is time range plus GPU index: given a request with start_time=T, duration=D, query DCGM for avg(gpu_sm_utilization)[T:T+D] on the same GPU

In practice you build this correlation in Grafana by templating dashboards that accept a time range from a trace link. Click a slow span in Langfuse, copy the time range, and query DCGM metrics for that window.

The patterns this reveals:

Trace signalGPU signalRoot cause
High p95 TTFT, normal TBTLow SM utilizationCPU-bound tokenizer or network I/O
High TBT, normal TTFTSM util > 95%Compute-bound decode, batch too large
High TTFT and TBTVRAM > 90% usedKV cache eviction under memory pressure
Normal latency, high TTFT variancePower state P1 vs P0 switchingGPU clock throttling under light load
Request errorsDCGM_FI_DEV_XID_ERRORS > 0Hardware-level fault, isolate node

For the DCGM Exporter setup and Prometheus alert rules, see the GPU monitoring for ML guide.

Storage and Retention at 10M+ Spans/Day

A vLLM server handling 100 requests/second generates 8.64M requests/day. Each span with prompt and response payloads runs 2-5 KB raw. That is roughly 25 GB/day before compression.

ClickHouse compresses trace data at roughly 10:1, so the actual stored volume is closer to 2.5 GB/day. A tiered retention strategy handles this without blowing up costs:

Hot tier (ClickHouse): 7 days of full spans with prompt/response payloads. Roughly 17.5 GB at 2.5 GB/day compressed. Query latency under 1 second for analytics.

Warm tier (Postgres): Aggregated stats only, no raw payloads, 90 days. Roughly 10 GB. Fast for dashboards and alert queries.

Cold tier (S3-compatible object storage): Compressed raw spans, 1 year. Roughly 500 GB for 10M spans/day. Restore on demand for audits.

Node sizing for the hot tier: a 4-core CPU node with 500 GB NVMe SSD handles this comfortably. For the ClickHouse node at 10M+ spans/day, size up to 8-core/32 GB with 500 GB NVMe.

To reduce storage, truncate prompt text at 4K characters and response text at 8K characters. In Langfuse, set LANGFUSE_INGESTION_MAX_REQUEST_BODY_SIZE_MB to limit per-request payload size.

Privacy and EU AI Act Compliance

Every SaaS observability platform sends your prompt and response data to their servers. Langfuse Cloud, Arize Cloud, LangSmith all store your inference data in their infrastructure, under their data retention policies, in their jurisdiction. For regulated workloads, that is a hard compliance problem.

Two specific EU AI Act requirements hit directly here:

Article 12 requires high-risk AI systems to keep logs enabling post-hoc monitoring of the system's operation. The logs must be retained and accessible to competent authorities.

Article 52 covers transparency obligations for certain AI interactions, requiring that outputs be traceable to the system that generated them.

Self-hosting gives you three things SaaS cannot: verifiable data location, access control audits, and data deletion on demand. These are exactly what compliance auditors ask for. You can produce a list of every person who accessed the trace data, when, and from what IP. You can delete a specific user's data on GDPR request without waiting for a vendor to act.

Helicone's proxy-based architecture deserves a mention here: for teams that cannot modify inference server code, routing all traffic through a self-hosted Helicone proxy gives centralized per-request logging without SDK changes. Every request passes through the proxy, which logs to your Postgres instance.

For a full breakdown of EU AI Act requirements and how to structure your GPU deployment to comply, see our EU AI Act compliance guide.

Spheron Pricing for a Small Observability Stack

The observability layer costs a fraction of what SaaS platforms charge at the same volume.

Observability node (Node B): A 4-core/16 GB CPU node running Langfuse and Postgres handles up to 5M spans/day. Cost is roughly $50-80/month depending on disk size.

High-volume ClickHouse node: An 8-core/32 GB node with 500 GB NVMe SSD handles 10M+ spans/day in the hot tier.

GPU inference node (Node A): GPU pricing depends on the model size and throughput requirements. H100 SXM5 nodes start at $4.41/hr on-demand and L40S nodes start at $0.72/hr on-demand for smaller workloads.

Pricing fluctuates based on GPU availability. The prices above are based on 22 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

SetupMonthly costPrompt data location
Langfuse Cloud (10M events)~$800-1500Langfuse servers (US)
Arize Cloud (10M spans)~$600-2000Arize servers (US)
Self-hosted on Spheron (Langfuse + ClickHouse)~$100-200Your Spheron nodes

For GPU inference workloads running alongside the observability stack, H100 GPU rental on Spheron starts at $4.41/hr on-demand.


Self-hosting observability means your prompt and response data stays on infrastructure you control - critical when you're working with regulated data or need audit trails for EU AI Act compliance. Run both the LLM workload and its full observability stack on Spheron without paying SaaS premiums for data you generated.

Rent H100 → | View GPU pricing → | Get started on Spheron →

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.