Effloow / Articles / Langfuse: Self-Host LLM Observability for Free — 2026 Guide
Langfuse: Self-Host LLM Observability for Free — 2026 Guide

Langfuse: Self-Host LLM Observability for Free — 2026 Guide

Deploy Langfuse free with Docker Compose. Open-source LLM observability covering traces, evals, prompt management, and Kubernetes scaling.

· Effloow Content Factory
#llm-observability #langfuse #self-hosting #docker #tracing #open-source #clickhouse #kubernetes
Share

When you deploy a traditional web application, you instrument it with metrics, traces, and logs. You watch latency, error rates, and throughput. This is basic engineering hygiene. Yet the majority of teams shipping LLM-powered features in 2026 do it completely blind — no visibility into prompt quality, model latency, token costs, or whether their RAG pipeline is actually retrieving the right context.

The consequences are predictable. A hallucination goes undetected in production for weeks. A prompt regression ships after a "minor refactor." Costs spike when a downstream API change causes token counts to balloon. LLM observability bridges this gap: it gives you the same operational confidence for AI systems that APM tools have given us for traditional software.

This is why Langfuse — acquired by ClickHouse in January 2026 alongside a $400M Series D at a $15B valuation — has become the de facto open-source standard for LLM engineering platforms. With over 23,000 GitHub stars, 26 million SDK installs per month, and adoption across 63 Fortune 500 companies, it is the tool that teams reach for first when they need production-grade LLM observability without vendor lock-in.

This guide covers everything you need to deploy and use Langfuse: architecture, Docker Compose setup, Kubernetes production deployment, evaluations, and how it stacks up against the leading alternatives.

What Is Langfuse?

Langfuse is an open-source LLM engineering platform that provides four core capabilities: observability (traces), evaluations, prompt management, and datasets. Its MIT-licensed core means you can run it entirely on your own infrastructure with full data ownership.

The project was founded in 2023 by Clemens Rawert, Max Deichmann, and Marc Klingen — backed by Lightspeed Ventures, General Catalyst, and Y Combinator (W23). Its acquisition by ClickHouse in January 2026 was a strategic fit: Langfuse's entire production analytics stack was already built on ClickHouse for time-series trace storage at scale.

Marc Klingen, CEO of Langfuse, described the rationale: "We built Langfuse on ClickHouse because LLM observability and evaluation is fundamentally a data problem. Now, as one team, we can deliver a tighter end-to-end product: faster ingestion, deeper evaluation, and a shorter path from a production issue to a measurable improvement."

The Four Pillars of Langfuse

Traces and Observations

Traces are the foundation of everything in Langfuse. Every LLM call, retrieval step, tool invocation, or custom logic block becomes a structured, searchable log entry.

Langfuse organizes data into three layers:

  • Traces: The top-level unit. One user request or agent run = one trace.
  • Observations: Individual steps within a trace. Specific subtypes include generation (LLM call), span (arbitrary code block), and retrieval (RAG fetch).
  • Sessions: Groups of related traces from one user session — essential for multi-turn chat applications.

Every observation captures: the exact prompt sent, model response, token counts (prompt / completion / total), latency, model name, and calculated cost. The Langfuse UI renders traces as a tree with per-node timestamps and shareable trace URLs for team debugging.

From January 2026, Langfuse's ingestion pipeline is OpenTelemetry-native. Roughly 60% of all observations on Langfuse Cloud now arrive via the OTel endpoint — meaning LangChain and LlamaIndex traces require no proprietary instrumentation at all.

Evaluations (LLM-as-a-Judge)

Evaluations answer the question: is my LLM application actually producing good outputs?

Langfuse supports three evaluation modes:

  1. LLM-as-a-judge: A capable judge model (GPT-4o, Claude Opus, etc.) scores outputs against criteria you define — faithfulness, relevance, helpfulness, toxicity, and more.
  2. Human annotation: Annotation Queues let you route specific traces to human reviewers. Their scores are stored alongside automated evaluations.
  3. Custom evaluation pipelines: Push numeric, boolean, or categorical evaluation scores via the REST API or SDK from any external evaluator.

Evaluations can target three data scopes: individual observations, full traces, or controlled experiment datasets. The asynchronous architecture processes thousands of evaluations per minute without adding latency to your live application.

Prompt Management

Prompt management in Langfuse is version-controlled, API-driven, and environment-aware. You create, edit, and deploy prompts via the UI, SDK, or API. Labels like production, staging, and dev let you run multiple prompt versions simultaneously and promote them without code changes. Client-side caching ensures prompt pulls add near-zero latency overhead to your application.

The critical integration is linking prompts to traces. Every trace generated from a managed prompt carries the version ID, so you can directly correlate a prompt change with downstream quality metrics. This is how you answer the question: "did this prompt update actually improve answer quality?"

Datasets and Experiments

Datasets are curated collections of inputs — and optionally expected outputs — used for systematic regression testing. You can build a dataset from production traces, run your application against it across multiple model or prompt configurations, and compare quality scores side-by-side in the Langfuse UI. This closes the feedback loop from "production quality regression detected" to "root cause found and fixed."

Langfuse v3 Architecture

Langfuse v3, released stable in December 2024 and iterated through 2026, runs as six containers:

Container Role
web Next.js UI + REST API
worker Async event processor
postgresql Metadata, users, projects, prompts
clickhouse Traces, observations, scores (analytics engine)
redis Event ingestion queue
minio Blob/object storage for large payloads

The key architectural decision in v3 was migrating traces, observations, and scores from PostgreSQL to ClickHouse. The Langfuse team observed severe PostgreSQL bottlenecks when dealing with millions of rows of tracing data — both on ingestion and retrieval. ClickHouse's columnar storage and vectorized query execution solves this. A 2026 optimization in trace-level attribute handling cut S3/blob storage costs by approximately 85% for some self-hosters.

Events flow as follows: SDK → REST API or OTel endpoint → validation + optional masking → Redis queue → async worker → ClickHouse + MinIO storage → UI query layer.

Docker Compose Quick Start

System requirements: 4+ CPU cores, 16 GiB RAM minimum. An AWS t3.xlarge or equivalent is the recommended starting point. Provision at least 100 GiB of storage for sustained trace volume.

# Clone the Langfuse repository
git clone https://github.com/langfuse/langfuse.git
cd langfuse

# Start all six containers
docker compose up -d

# Verify all containers are running
docker compose ps

Once running, open http://localhost:3000, create a project, and copy your LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY.

Instrument your first LLM call — Python drop-in for OpenAI SDK:

pip install langfuse openai
from langfuse.openai import openai  # Drop-in replacement — no other changes needed

# Set these environment variables:
# LANGFUSE_SECRET_KEY=sk-lf-...
# LANGFUSE_PUBLIC_KEY=pk-lf-...
# LANGFUSE_HOST=http://localhost:3000

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is LLM observability?"}],
    name="my-first-trace",
)
print(response.choices[0].message.content)

Navigate to http://localhost:3000Traces. Your first trace appears within seconds with the full prompt, response, token count, latency, and cost estimate.

LangChain users — add the callback handler:

from langfuse.callback import CallbackHandler

handler = CallbackHandler()
chain.invoke(
    {"input": "Explain Langfuse"},
    config={"callbacks": [handler]}
)

OpenTelemetry (no SDK dependency):

from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="http://localhost:3000/api/public/otel/v1/traces",
    headers={"Authorization": "Basic <base64(pk:sk)>"},
)

When to use Docker Compose vs Kubernetes: Docker Compose is ideal for development, staging, and low-traffic production (dozens of events/second). It lacks high-availability and horizontal scaling. For anything mission-critical, deploy on Kubernetes.

Production Deployment with Kubernetes

Langfuse maintains an official Helm chart via the langfuse/langfuse-k8s repository.

# Add the Helm repo
helm repo add langfuse https://langfuse.github.io/langfuse-k8s
helm repo update

# Deploy with a custom values.yaml
helm install langfuse langfuse/langfuse \
  -f values.yaml \
  -n langfuse \
  --create-namespace

Minimal values.yaml for production:

langfuse:
  nextauth:
    secret: "your-nextauth-secret-32-chars-minimum"
  salt: "your-salt-value"

postgresql:
  enabled: true
  auth:
    password: "strong-postgres-password"

clickhouse:
  enabled: true

redis:
  enabled: true

blob:
  provider: s3          # Prefer managed S3 over MinIO in production
  s3:
    bucket: "langfuse-traces"
    region: "us-east-1"
    accessKeyId: "AKIAXXXXXXXX"
    secretAccessKey: "your-secret"

Monitor deployment progress:

kubectl get pods -n langfuse -w

For high availability, configure replica counts for the web and worker deployments, and point the chart to external managed databases — Amazon RDS for PostgreSQL and ClickHouse Cloud for the analytics tier. This eliminates single points of failure in the storage layer.

Common Mistakes to Avoid

Running Docker Compose in production at scale. It works for a team of two at 50k events per month. When you hit hundreds of events per second, ClickHouse and the worker need independent resource allocation. Switch to Kubernetes before you need to.

Not sampling in high-volume environments. Tracing 100% of requests is fine during development, but at millions of daily requests the storage costs add up. Configure sampling rates before hitting production scale.

Skipping prompt-to-trace linkage. Prompt management is only valuable when prompts are linked to traces. Pass langfuse_prompt=my_prompt on every instrumented call from day one. The LangChain callback handler does this automatically.

Setting up evaluations after a quality incident. LLM-as-a-judge is most valuable when it runs continuously on production traces. Define your evaluation criteria and configure automated evaluators before you ship, not after something breaks.

Ignoring cost dashboards. Langfuse automatically calculates costs for all major providers. Cost-per-trace spikes are often the earliest signal that a prompt change or retrieval modification has introduced unexpected token inflation.

Langfuse vs. The Alternatives

The LLM observability space has consolidated in 2026 around several clear options. Here is a direct comparison of the most commonly evaluated tools:

FeatureLangfuseLangSmithArize PhoenixDatadog LLM
LicenseMIT (core)ProprietaryApache 2.0Proprietary
Self-hostingDocker / HelmLimitedLocal-firstCloud only
Free tier50k events/mo5k traces/moLocal unlimitedNone
LLM-as-judge evalsYesYesYesBasic
Prompt managementYes (versioned)YesNoNo
OpenTelemetryNative (60%+ traffic)PartialNativeVia APM only
Data ownershipFull (self-host)Vendor cloudFull (local)Vendor cloud
Pro tier pricing$199/mo~$39/user/mo$50k–100k/yr$8/10k req
Best forFull-stack self-hostersLangChain-native teamsML research / local debugExisting Datadog users

Langfuse and Arize Phoenix are the strongest choices when data ownership and self-hosting are hard requirements. LangSmith is a natural fit for teams deeply invested in the LangChain ecosystem who prefer managed infrastructure. Datadog LLM Monitoring is worth considering only if your organization already runs Datadog APM at scale — it is not a purpose-built LLM observability solution.

Pricing Breakdown

Langfuse Cloud pricing combines a subscription fee with usage-based events. A "billable unit" counts each trace, observation, or score logged.

Plan Price Included Events Users Data Retention
Hobby Free 50,000/mo 2 30 days
Core $29/mo 100k+ Unlimited 1 year
Pro $199/mo 1M+ Unlimited 3 years
Enterprise $2,499/mo Custom Custom Custom

Overage across all paid plans: $8 per 100,000 additional units (volume discounts apply at scale). The Pro plan adds SOC 2, ISO 27001, and HIPAA compliance certifications — relevant for healthcare and fintech use cases.

For self-hosted deployments, the MIT license covers the full feature set at zero licensing cost. Infrastructure costs depend on your trace volume, but a basic t3.xlarge (approximately $150/month on AWS) can handle tens of millions of events per month.

FAQ

Q: Is Langfuse really free to self-host with no feature limits?

Yes. The MIT license applies to the complete Langfuse codebase. There are no feature gates, paywalls, or telemetry requirements on self-hosted deployments. Traces, evaluations, prompt management, datasets, and the playground are all included. You pay only for the compute and storage costs of running your infrastructure.

Q: Does the ClickHouse acquisition change the open-source commitment?

No. ClickHouse has publicly committed to maintaining Langfuse as an open-source project, with the MIT license intact. The founders stated in their announcement: "Open-source is Langfuse's distribution strategy, not a product tier." The acquisition's goal is deeper integration between Langfuse's analytics layer and ClickHouse Cloud — not platform lock-in.

Q: How does Langfuse handle PII in LLM traces?

Langfuse supports configurable masking rules at the SDK level. You can redact sensitive fields from prompts and responses before data reaches storage — before it leaves your application process. For self-hosted deployments, trace data never transits third-party infrastructure. Both approaches can satisfy GDPR and HIPAA requirements when properly configured.

Q: Can Langfuse trace LLMs other than OpenAI?

Yes — Langfuse is model-agnostic. The OpenTelemetry endpoint accepts traces from any SDK. The official LiteLLM integration gives you unified trace ingestion across 140+ providers through a single proxy, automatically capturing costs for each provider in Langfuse's cost dashboards.

Q: How many events per second can Docker Compose handle?

The recommended Docker Compose configuration handles dozens of events per second reliably under normal load. For workloads at hundreds of events per second, the Kubernetes deployment with dedicated ClickHouse and PostgreSQL instances is required. The async worker is horizontally scalable — increase its replica count in values.yaml to match your ingestion throughput.

Key Takeaways

  • LLM observability is operational hygiene, not a nice-to-have. Production LLM applications without traces and evaluations are flying blind.
  • Langfuse v3 architecture — six containers using ClickHouse for analytics — is the same stack used in its cloud offering. Self-hosted performance is on par with the managed service.
  • Self-hosting is genuinely free under MIT. Docker Compose runs in five minutes locally; Kubernetes with the official Helm chart is the production path.
  • The ClickHouse acquisition in January 2026 deepens the analytics infrastructure investment. The platform's open-source commitment is unchanged.
  • Start with traces, add evaluations. Instrument first. Configure LLM-as-a-judge evaluators once you have a thousand or more traces and understand where quality problems actually occur.
  • Link prompts to traces from day one. This single habit is what separates teams that systematically improve LLM quality from teams that iterate blindly.
Bottom Line

Langfuse is the right choice for any team that needs production-grade LLM observability without vendor lock-in. Self-host it free under MIT with full data ownership, or use the cloud tier for a zero-ops managed experience. With ClickHouse's backing, a rapidly growing ecosystem, and the most complete self-hosted feature set in the market, this is the platform that serious AI engineering teams are standardizing on in 2026.

Get weekly AI tool reviews & automation tips

Join our newsletter. No spam, unsubscribe anytime.