OpenTelemetry in Production: Observability Cost Reality | IRCNF - Intelligent Reliable Custom Next-gen Frameworks

In 2019, when OpenCensus and OpenTracing merged to form OpenTelemetry under the CNCF umbrella, the pitch was simple: one vendor-neutral SDK for distributed tracing, and later for metrics and logs. By eliminating proprietary instrumentation lock-in, teams could swap backends without re-instrumenting their entire application. That promise has largely delivered. The problem is what happens when it does.

OpenTelemetry's tracing specification reached stable in 2021. The metrics spec followed in 2023. As of early 2025, OTel is the second-most active CNCF project by contributor count, behind Kubernetes. Every major observability backend — Datadog, New Relic, Honeycomb, Grafana, Dynatrace, and AWS CloudWatch — now accepts OTel data natively. The CNCF's 2024 survey found that 44% of organizations have OTel in production, up from 27% the year before. This is genuine adoption velocity, not survey noise.

What those numbers obscure is the operational reality teams hit when they turn OTel on in a serious production environment: the data volume is enormous, and the cost of storing, querying, and alerting on it grows faster than the system does.

The Cardinality Problem in Concrete Terms

Cardinality, in the context of observability, refers to the number of unique time series your metrics system must track. A single HTTP endpoint instrumented with OTel might emit a span for each request with attributes like http.method, http.status_code, http.route, http.target, and a custom user_id tag added by the developer. The moment you include user_id in a metric (not just in a trace), your cardinality explodes: instead of a handful of time series for that endpoint, you now have one time series per user who ever hit it.

This is not a theoretical concern. Prometheus, the most widely used open-source metrics backend, begins to experience serious performance degradation at around 10 million active time series. Teams that casually instrument request-level user IDs, session tokens, or IP addresses into their metrics can hit that limit with a single moderately trafficked service. Thanos and VictoriaMetrics, both designed to extend Prometheus at scale, can push that ceiling significantly higher, but neither solves the underlying economics: each time series costs storage and query compute proportional to its cardinality.

Datadog, which charges primarily by the number of custom metrics and ingested log volume, has made cardinality management a recurring theme in its customer support escalations. Teams frequently discover that the first month after enabling full OTel instrumentation results in a 3x to 10x spike in their observability bill, driven almost entirely by high-cardinality attributes that nobody thought to restrict.

Sampling: The Tool That Changes What You Can Know

The canonical solution to data volume is sampling — capturing only a fraction of traces rather than all of them. OpenTelemetry supports two broad approaches: head-based sampling (decided at the start of a request) and tail-based sampling (decided after the request completes, based on outcome).

Head-based sampling is cheap and predictable. If you sample 10% of requests, you store 10% of the data. The downside is that you don't know at sampling time whether a request will be slow, fail, or trigger a rare code path. The 90% you drop will contain some of the most interesting traces.

Tail-based sampling solves this by collecting all spans for a trace and then deciding whether to keep the trace based on its characteristics — error status, latency threshold, specific attribute values. The OpenTelemetry Collector supports tail-based sampling via its tailsamplingprocessor. The practical catch: tail-based sampling requires buffering all spans for a trace before making the keep/drop decision, which means maintaining state across a potentially distributed collector fleet. Companies like Honeycomb have built proprietary solutions (Refinery) around this problem. It's solvable, but it adds operational complexity that teams often underestimate.

The Logs Specification and Its Current Gaps

OpenTelemetry's logs specification reached stable in late 2023, but the log-to-trace correlation story — the ability to automatically link a log line to the trace that produced it — is still inconsistent across languages. The Go SDK, Java SDK, and Python SDK all handle log correlation differently, and the behavior depends heavily on which logging library you're bridging (Logback, Zap, structlog, etc.).

In practice, most teams are still running separate log pipelines (Fluent Bit, Fluentd, Logstash) alongside their OTel pipelines for traces and metrics, even when those teams nominally describe themselves as "fully on OTel." True unified pipeline adoption — logs, traces, and metrics all flowing through the OTel Collector to a single backend — remains relatively rare. The Grafana stack (Loki + Tempo + Mimir) comes closest to a fully OTel-native open-source story, but even there, the operational overhead of running the full stack is non-trivial.

Vendor Differentiation in an OTel-Native World

The commoditization of instrumentation creates an interesting competitive dynamic for observability vendors. If any OTel-instrumented application can send data to any OTel-compatible backend, vendor lock-in moves from the agent/SDK layer to the query and analytics layer. Honeycomb has leaned heavily into this, building its product around columnar storage and high-cardinality query performance (BubbleUp, Query Builder) that Prometheus-derivative stacks cannot match for trace analysis. Grafana is competing on breadth — offering the most complete open-source observability stack while also providing a managed service. Datadog competes on integrated correlation: its APM, logs, and infrastructure monitoring are more deeply joined in the query experience than anything in the open-source ecosystem.

The implication for teams choosing a backend is that the instrumentation choice (OTel) is now largely decoupled from the storage and query choice. This is genuinely liberating, but it also means the backend selection decision now rests almost entirely on query UX, cardinality handling, and pricing model — not on which agent is easiest to install.

What Teams Are Actually Deploying in 2025

Based on public engineering blog posts and CNCF survey data, the most common 2025 OTel production configuration looks like this: OTel SDK auto-instrumentation for traces in Java, Python, or Node.js; a lightweight OTel Collector sidecar or DaemonSet for batching and routing; head-based sampling at 5-20% for high-traffic paths with tail-based sampling for error traces; metrics via OTel SDK sent to Prometheus-compatible endpoint with aggressive metric allow-listing to control cardinality; logs still in a separate Fluent Bit pipeline.

The auto-instrumentation path (zero-code changes required) works well for standard HTTP frameworks, database drivers, and message queue clients. It breaks down for custom business logic — if you want to trace your order fulfillment pipeline with meaningful span names and business-relevant attributes, manual instrumentation of your domain code is unavoidable. This is where the 80/20 trap lives: auto-instrumentation gives you 80% of the observable surface for minimal effort, but the 20% that actually explains why an incident happened often requires significant manual work.

Actionable Takeaways

Before enabling full OTel instrumentation in production, audit your high-cardinality attributes. Any attribute that grows unboundedly (user IDs, session tokens, request IDs) should stay in trace spans, not in metrics. Set explicit metric allow-lists with the OTel Collector's filter processor before your first bill arrives.
Tail-based sampling is superior to head-based for capturing interesting traces, but it requires a stateful collector tier. If your team lacks the operational capacity to run that, head-based sampling at 5-10% with 100% sampling for error traces is a practical starting point.
The OTel Collector's transform processor can redact or drop high-cardinality attributes before they reach your backend. Use it as a backstop even if you're careful about instrumentation — third-party libraries often emit attributes you don't control.
If you're evaluating backends, prioritize the query experience for trace analysis over cost per GB of ingested data. Slow trace queries during an incident cost more in engineer time than the marginal difference in storage pricing between vendors.
The CNCF OpenTelemetry Demo application is the most complete reference implementation of OTel across multiple language SDKs. Run it locally before rolling out to production — it surfaces the instrumentation decisions you'll face before you face them under pressure.

OpenTelemetry Reached Production Maturity — Now Teams Are Learning What Observability Actually Costs