IRCNF

OpenTelemetry Has Won the Observability Wars. Now Comes the Hard Part.

Share:
OpenTelemetry Has Won the Observability Wars. Now Comes the Hard Part.

The Problem Nobody Talks About Enough

You're running a distributed system in production. Something is wrong — latency spiked, error rates climbed, a user is filing a support ticket. Your first instinct is to look at logs. But which service? You have twelve of them. The logs are there, somewhere, scattered across three different tooling stacks left behind by engineers who've since moved on. You grep. You pivot. Twenty minutes later you find the root cause: a database query that started timing out because a schema migration ran on the wrong host.

This is the observability problem. Not the alerting. Not the dashboards. The fundamental problem: understanding what is actually happening inside your systems when things go wrong — and increasingly, before they go wrong. The three canonical signals are logs (discrete events), metrics (numeric measurements over time), and traces (the journey of a request through a distributed system). Before OpenTelemetry, collecting all three in a coherent, portable way was genuinely painful.

The Lock-In Era

Every observability vendor — Datadog, New Relic, Dynatrace, Honeycomb, Jaeger — shipped its own SDK. Its own agent. Its own wire protocol. If you instrumented your Python service with the Datadog tracer and later decided to evaluate Honeycomb, you were looking at rewriting your instrumentation layer. Not because the vendors were being malicious, but because there was no standard. The telemetry collection layer was the moat.

This wasn't a minor annoyance. It meant teams were making vendor decisions early, before they had enough production data to know what they actually needed. It meant platform teams spent engineering cycles maintaining multiple telemetry pipelines. It meant switching costs were high enough that most teams just didn't switch — even when a better tool came along.

Where OpenTelemetry Came From

In 2019, two competing open-source observability projects — OpenCensus (backed by Google) and OpenTracing (a CNCF project) — merged into OpenTelemetry. The merger avoided what would have been a damaging fragmentation of the ecosystem. OpenTelemetry became a CNCF project, and it grew fast. By contributor count, it's now the second most active CNCF project after Kubernetes.

The value proposition was simple: instrument once, export anywhere. One SDK per language. One wire protocol — OTLP (OpenTelemetry Protocol). One collector — otel-collector — to receive, process, and export telemetry to any backend. Vendors compete on analysis and visualization, not on instrumentation lock-in.

The State of OTel in 2026

The bet paid off. OpenTelemetry is deployed in production at Google, Microsoft, Shopify, Uber, and thousands of other organizations. Every major observability backend — Datadog, Grafana, Elastic, Honeycomb, Dynatrace, New Relic, Splunk — accepts OTLP natively. The CNCF reports that OpenTelemetry is in use at 83% of Fortune 500 companies in some capacity. That's not a niche project anymore. That's infrastructure.

The Go, Java, Python, and JavaScript SDKs are stable and well-maintained. Auto-instrumentation for popular frameworks — Django, Express, Spring Boot, Rails — works reliably. You can add a single dependency, set a few environment variables, and have traces flowing to your backend without touching application code. For greenfield projects, this is the obvious starting point.

What's Actually Working

Traces are the strongest part of the OTel story. The tracing spec is mature. The SDKs are stable. Auto-instrumentation for HTTP, database calls, and messaging systems covers the most common instrumentation needs. If you're starting with OTel today, start here.

The otel-collector is powerful and production-proven. It can receive telemetry from dozens of sources, apply transformations, filter noise, and fan out to multiple backends simultaneously. Teams running complex multi-cloud environments use the collector as a telemetry routing layer — export everything to the collector, let the collector handle backend routing.

The ecosystem has matured around the collector, too. Contrib receivers and exporters cover most enterprise data sources. The operator for Kubernetes is solid. Vendor distributions of the collector provide tested, supported builds with vendor-specific exporters pre-configured.

What's Still Hard

Logs are only recently stable in the OTel spec. The logging signal lagged behind traces and metrics significantly, and while it's now stable, the ecosystem hasn't fully caught up. Correlating logs with traces using TraceId and SpanId — which is the whole point — still requires deliberate wiring in most frameworks. It works, but it's not zero-effort yet.

Metrics semantic conventions are inconsistent across languages. A metric named one thing in the Java SDK is named something slightly different in the Python SDK. This is being addressed, but slowly. Teams building cross-language dashboards hit this friction.

Collector configuration is complex. The otel-collector YAML configuration — receivers, processors, exporters, pipelines — is flexible but verbose. As topologies grow, you end up with YAML files that are genuinely difficult to reason about. There's no great solution here yet. Some teams use jsonnet or cue to generate collector configs. It's a solvable problem, but it's real operational work.

Tail-based sampling is powerful but operationally heavy. Head-based sampling is simple and stateless. Tail-based sampling lets you keep 100% of error traces and slow traces while dropping routine ones — but it requires stateful infrastructure. The tailsampling processor in the OTel Collector works, but it requires careful deployment architecture to ensure all spans from a given trace land on the same collector instance.

The Cardinality Problem OTel Didn't Solve

OpenTelemetry makes it easy to add attributes to your telemetry. That's a feature. It's also a footgun. Adding high-cardinality attributes — user IDs, request IDs, session tokens — to metrics is a classic mistake that causes cost explosions at backends that charge per metric series (which is most of them).

OTel doesn't protect you from this. It faithfully forwards whatever attributes you attach. The cardinality explosion is a backend problem, but OTel is the mechanism by which you create it. Teams adopting OTel need to understand this early: metrics cardinality governance is not optional, it's operational hygiene. Use high-cardinality attributes on spans and logs, where they belong. Keep metric dimensions low and deliberately bounded.

The Fourth Signal: Profiling

OpenTelemetry is working on a fourth observability signal: continuous profiling. Profiling data — CPU flame graphs, memory allocation traces — has historically lived in separate tools with no standard format or correlation with the other three signals.

The OTel profiling signal is still experimental, but Grafana, Elastic, and several other vendors are already betting on it. When it stabilizes, it will close the last major gap in the OTel observability story.

Practical Advice for Teams Adopting OTel Today

  • Start with traces. The tracing signal is the most mature. Use auto-instrumentation first — don't manually instrument before you've seen what auto-instrumentation gives you.
  • Pick your backend before you configure the collector. Don't build a complex collector topology before you know your backend. A single exporter is fine to start.
  • Don't roll a custom collector topology prematurely. Add it when you need fan-out, filtering, or enrichment — not before.
  • Understand cardinality before you add metric attributes. Define your allowed label set explicitly.
  • Use the SDKs, not the API directly. Only depend on the API if you're writing a library that shouldn't force an SDK choice on consumers.

The War Is Won. The Work Isn't Done.

OpenTelemetry has achieved something genuinely rare in infrastructure software: it killed the instrumentation lock-in problem. Vendors compete on the quality of their analysis, their query languages, their alerting, their UX — not on trapping you in their SDK. That's a better world.

But "the standard won" is the beginning of the story, not the end. The hard part — consistent cross-language conventions, approachable collector configuration, stable logs, cardinality governance, profiling — is still being worked through. Teams adopting OTel today are doing so in a mature but still-evolving ecosystem. That's fine. The foundation is solid. Build on it deliberately.

Share:
OpenTelemetry Has Won the Observability Wars. Now Comes the Hard Part. | IRCNF - Intelligent Reliable Custom Next-gen Frameworks