IRCNF

OpenTelemetry has become the observability standard — now comes the hard part of actually using it

Share:
OpenTelemetry has become the observability standard — now comes the hard part of actually using it

Three years ago, engineering teams choosing an observability stack faced a painful decision: adopt a vendor's proprietary instrumentation agents and accept lock-in, or stitch together disparate open-source tools and accept maintenance burden. OpenTelemetry — the CNCF project that merged OpenCensus and OpenTracing in 2019 — has largely resolved that decision. By 2026, it has become the de facto standard for telemetry data collection across the industry.

Datadog, Honeycomb, Grafana, New Relic, Dynatrace, and Splunk all accept OpenTelemetry data natively. Cloud providers including AWS, Google Cloud, and Azure have native OpenTelemetry integration. The SDKs for Python, Java, Go, JavaScript, and .NET have reached stable releases. The OpenTelemetry Collector — the pipeline component that receives, processes, and exports telemetry — is deployed in production at companies ranging from small startups to hyperscalers.

OpenTelemetry won the instrumentation war. The next problem is using it well.

What OpenTelemetry actually standardizes

OpenTelemetry defines three telemetry signals: traces (the path of a request through distributed services), metrics (numerical measurements over time), and logs (structured event records). The project provides SDKs for generating these signals, a wire format (OTLP — OpenTelemetry Protocol) for transmitting them, and the Collector for routing them to backends.

What it doesn't standardize is what to do with the data once you have it. The query languages, alerting models, dashboard tools, and analysis workflows are all backend-specific. Switching from Datadog to Grafana still requires rewriting dashboards and alerts — the data collection layer is now portable, but the analysis layer isn't. This is the "observability portability" problem that the industry is still working through.

Where teams get stuck

The most common failure mode is treating OpenTelemetry as a checkbox rather than a practice. Teams instrument their services, see traces appearing in their observability backend, and declare success. Six months later, when a production incident occurs, they discover that the traces are incomplete (some services weren't instrumented), the spans lack useful attributes (no user IDs, no feature flags, no business context), and the cardinality of their metrics has blown past their backend's limits.

Instrumentation quality is the first gap. Auto-instrumentation — where the SDK automatically generates spans for HTTP calls, database queries, and message queue operations — captures the structural skeleton of a request but nothing about the business logic. Whether an API call was for a premium customer or a free-tier customer, which feature flag was active, what the cart total was: these are attributes that require manual instrumentation to capture. The teams that get the most value from OpenTelemetry are the ones that treat span attributes as a product of deliberate design rather than automatic generation.

Cardinality is the second gap. Every unique combination of attribute values creates a new time series in a metrics backend. A metric tagged with user ID, region, feature flag, and HTTP status code can create millions of distinct series — and many observability backends charge by time series or impose hard limits. Teams that instrument metrics without thinking carefully about cardinality end up either paying unexpectedly large observability bills or hitting data loss limits exactly when they need the data most.

The Collector as a strategic asset

The OpenTelemetry Collector is arguably the most underutilized component in the ecosystem. Most teams deploy it as a dumb forwarder — receive OTLP from services, export to backend. The Collector's processor pipeline can do substantially more: filtering out high-cardinality data before it reaches the backend, sampling traces based on business rules (sample 100% of error traces, 1% of successful ones), enriching spans with metadata from Kubernetes or service discovery, and routing different signals to different backends.

A well-configured Collector pipeline can reduce observability costs by 60-80% without losing meaningful signal, by aggressively dropping low-value data (health check traces, routine background jobs) while preserving everything relevant to debugging and business analysis. Teams that treat the Collector as infrastructure to configure thoughtfully rather than deploy and forget extract dramatically more value from the same budget.

Semantic conventions: the underappreciated standard

OpenTelemetry's semantic conventions — a specification for how to name and structure attributes on spans and metrics — are the project's most underappreciated contribution. When every team names their database span attributes differently, you can't write generic dashboards or alerts that work across services. When everyone follows the same conventions (db.system, db.statement, http.method, http.status_code), tooling can reason about telemetry without service-specific configuration.

Semantic convention adoption is still incomplete in practice. The stable conventions cover HTTP, databases, messaging systems, and RPC — the most common infrastructure patterns. Many application-specific patterns are either not covered or still in experimental status. Teams building on OpenTelemetry should invest time in defining their own attribute conventions for application-level context, following the naming patterns the project establishes, to get the same composability benefits at the application layer.

What good looks like in 2026

The teams getting the most value from OpenTelemetry in 2026 share a few characteristics. They have a platform team or observability champion who owns the Collector configuration and instrumentation standards. They've defined a set of required span attributes for new services and enforce them in code review. They've done explicit cardinality analysis on their metrics and made deliberate choices about which dimensions to keep. They use tail-based sampling to capture complete traces for errors and slow requests while aggressively sampling routine traffic.

They also treat observability data as a product, not infrastructure exhaust. The question isn't "are we collecting data?" but "can the on-call engineer find the root cause of a production incident in under 10 minutes?" That standard drives different instrumentation decisions than "is the Collector running?"

OpenTelemetry solved the hardest coordination problem in observability: getting the industry to agree on a common data format. The next layer of problems — what to instrument, how to configure the pipeline, what to build on top — are engineering and organizational challenges that each team has to work through for themselves. The good news is that the foundation is now stable enough to build on.

Share:
OpenTelemetry in 2026: Beyond Instrumentation to Actual Observability | IRCNF | IRCNF - Intelligent Reliable Custom Next-gen Frameworks