OpenTelemetry

Adopting OpenTelemetry Without Rewriting Your Entire Stack

Logpathio Team 10 min read
Abstract visualization representing the OpenTelemetry standard: a bridge connecting multiple data sources to a unified collection point

The OpenTelemetry migration guides make it look like a configuration exercise. Point your services at a Collector, swap out the SDK, verify traces are flowing. What the guides skip: the sampling decisions you have to make before your first production deploy that you cannot easily reverse, the 30% of instrumentation coverage that auto-instrumentation will never reach, the Collector memory sizing that will cause your pipeline to start dropping spans under traffic spikes if you get it wrong, and the quiet incompatibility between your current Datadog or New Relic agent and the OTEL semantic conventions that will make your existing dashboards useless until you rebuild them.

None of this means OTEL isn't worth adopting — it is, and the portability value compounds over time. It means going in with an accurate picture of the work rather than the optimistic one.

The Collector Pipeline: Receivers, Processors, Exporters — and Where It Goes Wrong

The OTEL Collector is a data pipeline with three stages: receivers (accept telemetry from sources), processors (transform or filter data in transit), and exporters (send data to backends). Understanding this pipeline is the prerequisite for understanding why Collector misconfiguration causes silent data loss rather than obvious errors.

A minimal Collector configuration for receiving OTEL traces and exporting to a backend:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"   # OTLP gRPC
      http:
        endpoint: "0.0.0.0:4318"   # OTLP HTTP

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512           # hard stop — drops data to protect process
    spike_limit_mib: 128     # headroom for traffic spikes
  batch:
    send_batch_size: 1024    # flush after this many spans
    timeout: 5s              # or after this long, whichever comes first
  resourcedetection:
    detectors: [env, system, eks, gcp, azure]  # auto-detect cloud resource attributes

exporters:
  otlp/backend:
    endpoint: "api.yourbackend.com:4317"
    tls:
      insecure: false

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]  # memory_limiter MUST be first
      exporters: [otlp/backend]

The processor ordering matters. memory_limiter must come before batch in the pipeline. If the batch processor fills up first, memory pressure builds in the batch queue. If memory_limiter is first, it can shed load before the batch queue becomes a memory liability. Every example in the official docs shows this correctly; teams that copy from other sources sometimes get it wrong and don't notice until a traffic spike causes OOM.

Auto-Instrumentation Gives You 70%. Here Is What the Other 30% Requires.

The OTEL auto-instrumentation agent (Java agent, .NET auto-instrumentation, Node.js auto-instrumentation) instruments library calls automatically. For a Java service using Spring Boot, RestTemplate, JDBC, and a standard Kafka client, you get HTTP server spans, HTTP client spans, database query spans, and Kafka producer/consumer spans with zero code changes. This is genuinely valuable and covers the structural scaffolding of most request flows.

What auto-instrumentation cannot instrument: your business logic. The span created by Spring's HTTP server instrumentation tells you "POST /api/checkout took 2.3 seconds." It does not tell you what happened inside those 2.3 seconds at the application level — which step in the checkout workflow was slow, whether it was the inventory check or the payment authorization or the order serialization, what the order ID was, or whether the user was in a particular cohort.

The 30% that requires manual span creation covers:

  • Business-critical workflow steps that aren't library calls (a multi-step payment processing function, a complex order validation workflow)
  • Span attributes that carry business context (order.id, user.tier, product.category) — these enable incident correlation at the business-logic level, not just the infrastructure level
  • Any code path that uses a library the auto-instrumentation agent doesn't know about (internal RPC frameworks, custom database drivers, proprietary message queue clients)
  • Background jobs and scheduled tasks that don't enter through an instrumented HTTP handler
// Go: manual span for business-critical step
func processCheckout(ctx context.Context, order Order) error {
  ctx, span := tracer.Start(ctx, "checkout.validate_inventory",
    trace.WithAttributes(
      attribute.String("order.id", order.ID),
      attribute.Int("order.item_count", len(order.Items)),
      attribute.String("customer.tier", order.CustomerTier),
    ),
  )
  defer span.End()

  if err := validateInventory(ctx, order); err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
    return err
  }
  return nil
}

The coverage question to answer early: for each service in your P1 critical path, what are the business-logic steps that would matter during an incident? That list is your manual instrumentation backlog. Start it before the OTEL migration, not after. Auto-instrumentation gets the pipeline running quickly; the manual spans are what make traces actionable for your specific domain.

Sampling Decisions You Cannot Defer

Most teams deploy OTEL with a head-based sampling rate (5%, 10%, or 100%) and treat sampling strategy as something to revisit later. This is a mistake, because head-based sampling at rates below 100% means you're dropping the traces you most need during incidents. A 10% sample rate drops 90% of your error traces, your high-latency traces, and the unusual traces that represent novel failure modes.

The sampling decision to make before production deployment is whether you want tail-based sampling. Tail-based sampling buffers complete traces and applies policies against the full trace data — you can keep all error traces and all traces above a latency threshold while sampling down the routine successful requests. This requires a stateful Collector tier (the tail_sampling processor) with consistent-hash load balancing in front of it, which is more complex to operate than a simple DaemonSet Collector.

For a team running 500–5,000 traces per second, a minimal tail-sampling configuration:

# tail_sampling processor — requires stateful collector tier
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000        # buffer capacity; size to (TPS × decision_wait × 1.2)
    expected_new_traces_per_sec: 2000
    policies:
      - name: keep-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: keep-slow
        type: latency
        latency: {threshold_ms: 1500}
      - name: keep-sampled-baseline
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
      - name: composite-priority
        type: composite
        composite:
          max_total_spans_per_second: 5000
          policy_order: [keep-errors, keep-slow, keep-sampled-baseline]
          rate_allocation:
            - policy: keep-errors
              percent: 60
            - policy: keep-slow
              percent: 30
            - policy: keep-sampled-baseline
              percent: 10

The num_traces value is where teams most often misconfigure tail sampling. Too small and the buffer fills under traffic spikes, causing the Collector to fall back to head-based sampling (or drop). Calculate: expected TPS × decision_wait seconds × 1.5 safety factor. At 2,000 TPS with a 10-second decision window, you need at least 30,000 buffered traces; 100,000 provides reasonable headroom.

Migrating from Datadog or New Relic: The Semantic Convention Gap

If you're migrating from a Datadog agent or a New Relic agent to OTEL SDKs, expect to rebuild dashboards and alerts. This is not a failure of OTEL — it's a consequence of semantic convention differences between vendor-specific attribute names and OTEL's standardized attribute names.

Datadog uses http.url; OTEL uses url.full. Datadog uses service as a tag; OTEL uses service.name as a resource attribute. Datadog's APM trace format is different from OTLP. Grafana Tempo's trace ID format is compatible with OTEL; Datadog's trace ID format is not directly compatible.

The practical impact: every Datadog dashboard widget and alert that queries on vendor-specific attribute names will break when you switch to OTEL instrumentation. Before starting the migration, audit your existing dashboards and alert conditions for vendor-specific attribute references. Build the equivalent OTEL-semantic queries for each one before you decommission the vendor agent. Running both agents in parallel during migration (see below) gives you time to validate the new queries against live data.

The Log Pipeline: Keep What You Have, Add the Trace ID Bridge

OTEL logs are the least mature component of the specification and the most disruptive to migrate. For most teams running Fluentd, Filebeat, Vector, or cloud-provider log shipping, the right answer is to keep the existing log pipeline and not migrate it to OTEL format. The OTEL Collector can receive logs in multiple formats (including Fluentd forward protocol and raw JSON), but the maturity of OTEL log instrumentation varies significantly by language.

The integration that delivers most of the value without a log pipeline migration: inject the current OTEL trace context (trace ID and span ID) into your existing structured log output. This makes every log line queryable by trace ID, enabling correlation between logs and traces without changing log format, log shipper, or log backend.

# Python: inject trace context into structlog output
import structlog
from opentelemetry import trace

def add_trace_context(logger, method, event_dict):
    span = trace.get_current_span()
    ctx = span.get_span_context()
    if ctx.is_valid:
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

structlog.configure(
    processors=[
        add_trace_context,
        structlog.processors.JSONRenderer(),
    ]
)

With this in place, every log line emitted during a traced operation carries the trace ID. Log-trace correlation is available without a log pipeline migration.

Running Parallel Instrumentation: The Risk Mitigation Pattern

The most operationally important practice during OTEL migration is running the vendor SDK and the OTEL SDK in parallel for a defined validation period. This means your service emits telemetry to both your existing vendor backend and the new OTEL pipeline simultaneously. It uses more resources and doubles your instrumentation overhead — but it ensures that a misconfiguration in OTEL doesn't leave you with zero observability coverage during an incident.

The validation checklist before decommissioning the vendor SDK for a service:

  • OTEL traces are appearing in the backend for this service with correct parent-child span relationships
  • Trace context is propagating correctly to and from all services this service calls
  • Error spans are capturing the correct exception details (not just a generic "error" status)
  • The equivalent dashboards and alert conditions based on OTEL semantic conventions are validated against at least one full week of traffic, including any weekly traffic pattern peaks
  • At least one simulated incident investigation has been run using only OTEL traces (no fallback to vendor traces) and the result was satisfactory

Teams that decommission the vendor SDK before running the parallel validation checklist typically discover the remaining gaps during the first real incident after the migration — not an ideal time to discover that your sampling configuration is dropping error traces or that your trace context propagation breaks at the Kafka boundary you haven't instrumented yet.

The migration is not complete when spans are flowing. It's complete when your on-call team could handle a P1 incident using only the OTEL pipeline and feel confident about the coverage. Run the last point on that checklist as a tabletop exercise, not as a real incident.