Distributed Tracing in Kubernetes: A Practical Guide for SRE Teams

The standard distributed tracing tutorial takes you from zero to Jaeger in about 20 minutes. What it doesn't cover: what happens when your trace context crosses a Kafka topic boundary, how to configure tail-based sampling without running out of collector memory under load, the difference between sidecar and SDK instrumentation when your Istio service mesh is already doing some of the work, and why trace ID cardinality will be your first scaling bottleneck — not the one you were worried about.

This is the stuff that comes up in your second week of production tracing, not your first demo.

Context Propagation: W3C TraceContext vs. B3 and Why the Header Format Matters

Before anything else, decide on a trace context propagation format and be explicit about it. The two dominant formats are W3C TraceContext (traceparent / tracestate headers) and B3 (Zipkin's format, either single-header b3 or multi-header X-B3-TraceId / X-B3-SpanId / etc.). OpenTelemetry supports both, and Envoy/Istio supports both. The problem: if service A propagates W3C and service B expects B3, the trace breaks silently at that service boundary.

In a multi-cluster Kubernetes environment where different clusters were instrumented at different times, format mismatches are common. The OTEL Collector's propagators configuration accepts a list — set it explicitly on every service and on your Collector:

// otel-collector-config.yaml
service:
  pipelines:
    traces:
      processors: [batch, memory_limiter]

// Application SDK configuration (Go example)
otel.SetTextMapPropagator(
  propagation.NewCompositeTextMapPropagator(
    propagation.TraceContext{},
    propagation.Baggage{},
  ),
)

If you're on Istio and already have Envoy sidecars, Envoy propagates B3 by default unless you configure it otherwise. Check the Istio meshConfig.defaultConfig.tracing settings before assuming propagation format consistency across the mesh.

Sidecar vs. SDK Instrumentation: The Instrumentation Decision

Envoy/Istio auto-instrumentation via sidecar gives you service-to-service request tracing essentially for free — Envoy injects and propagates trace context at the proxy layer without any application code changes. This is genuinely valuable and covers the majority of inter-service call tracing for HTTP and gRPC traffic.

What sidecar instrumentation does not give you: span granularity inside the application, business logic context (user IDs, order IDs, feature flags), database query spans, or any instrumentation for async operations that don't go through the service mesh. A trace through Envoy shows you "order-service called payment-service and it took 847ms." It doesn't show you which database query inside payment-service accounted for 620ms of that time, or that the slow query was on the orders table and correlated with a specific order ID range.

The right architecture for most production Kubernetes environments: use Istio/Envoy sidecar instrumentation as the baseline (zero code changes, complete service graph visibility), and add OTEL SDK instrumentation for services where internal span granularity matters for your SLIs. We're not saying sidecar-only is wrong — for many services it's sufficient. We're saying that the services on your critical path for user-facing SLIs almost always need SDK-level instrumentation to give you actionable root cause data.

What Happens When Your Trace Crosses a Kafka Boundary

This is the gap that breaks more production tracing setups than anything else. HTTP and gRPC context propagation is handled automatically by OTEL SDK instrumentation. Kafka does not have an equivalent abstraction — message headers exist, but OTEL SDKs do not automatically inject or extract trace context from them.

When order-service publishes an event to a Kafka topic and fulfillment-service consumes it, the default behavior is: the trace from the producer terminates, and a brand new root trace starts in the consumer. You lose the causal link. From the consumer's perspective, it processed a message with no tracing context. You cannot correlate a payment timeout in order-service with the fulfillment failures that followed three seconds later.

The fix requires code on both sides:

// Producer: inject trace context into Kafka message headers (Go + sarama)
propagator := otel.GetTextMapPropagator()
carrier := make(map[string]string)
propagator.Inject(ctx, propagation.MapCarrier(carrier))

msg := &sarama.ProducerMessage{
  Topic: "order-events",
  Value: sarama.StringEncoder(payload),
  Headers: headersFromMap(carrier), // inject trace context
}

// Consumer: extract and restore trace context
carrier := extractCarrierFromHeaders(msg.Headers)
ctx = propagator.Extract(context.Background(), propagation.MapCarrier(carrier))
ctx, span := tracer.Start(ctx, "fulfillment.process-order-event")
defer span.End()

This needs to be done for every producer/consumer pair. It's not a one-time config change — it requires touching application code. Factor this into your OTEL adoption timeline; if you have many Kafka-connected services, the async boundary instrumentation is likely the largest single work item in the migration.

Trace ID Cardinality: Your First Scaling Bottleneck

Trace IDs are 128-bit random values, meaning every trace has a globally unique ID. In a high-traffic system generating 10,000 traces per second, you're creating 10,000 new trace IDs per second. This isn't the cardinality problem — trace IDs are expected to be unique.

The cardinality problem comes from trace attributes (span tags). If you add a user.id attribute to spans, you now have a potentially unbounded set of unique attribute combinations: service × operation × user.id. In a Jaeger or Tempo backend, this multiplies index storage. In a Prometheus-based trace metrics pipeline (if you're using the OTEL Collector's spanmetrics processor to derive RED metrics from traces), user-level attributes create metric series explosion.

A concrete example: a payment processing service emitting http.server.duration metrics derived from traces, with dimensions http.method × http.status_code × service.name — that's maybe 5 × 6 × 1 = 30 time series. Add user.id to that metric derivation with 50,000 active users and you have 1,500,000 time series for one metric. The spanmetrics processor configuration needs explicit attribute allow-lists:

# otel-collector-config.yaml (spanmetrics processor)
processors:
  spanmetrics:
    metrics_exporter: prometheus
    dimensions:
      - name: http.method
      - name: http.status_code
      - name: service.name
      # Do NOT include user.id, session.id, request.id here
    # user.id belongs in trace attributes (for trace-level queries)
    # not in derived metrics (would cause cardinality explosion)

Tail-Based Sampling vs. Head-Based: The Production Configuration

Head-based sampling decides at trace start whether to record the trace. It's simple and low-overhead. The problem is fundamental: you make the keep/drop decision before you know if the trace is interesting. A 10% head-based sample rate means you drop 90% of your error traces, your slow traces, and your unusual traces — exactly the ones you need during an incident.

Tail-based sampling buffers the full trace, waits for the root span to complete, then evaluates policies against the complete trace data:

# otel-collector-config.yaml (tail sampling processor)
processors:
  tail_sampling:
    decision_wait: 10s        # buffer window — wait for trace completion
    num_traces: 50000         # max concurrent buffered traces (size carefully)
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 2000}
      - name: probabilistic-baseline
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

The tradeoff that documentation undersells: tail sampling requires a stateful collector tier. All spans for a given trace must arrive at the same collector instance, or the decision logic can't assemble the full trace for evaluation. This means you need consistent hashing at the load balancer in front of your tail-sampling collector tier — typically a two-tier setup where the first tier does load balancing with consistent hashing, and the second tier holds the tail-sampling processor.

The memory sizing for num_traces is critical. At 50,000 buffered traces with an average of 20 spans per trace and ~1KB per span, you're at ~1GB of trace buffer. Under a traffic spike that generates 5,000 traces/second (each waiting 10 seconds for the decision_wait window), you need 50,000 concurrent buffered traces. Run out of buffer and the collector starts dropping traces to protect itself — and head-based sampling is your fallback.

Multi-Cluster Namespace Boundaries: Where Traces Break

In a multi-cluster Kubernetes environment, trace propagation across cluster boundaries requires explicit configuration at the ingress and egress layers. If your cluster-to-cluster traffic goes through a gateway (AWS ALB, GCP Load Balancer, nginx ingress), verify that the gateway passes through traceparent headers. Many default gateway configurations strip unknown headers for security reasons.

The verification test: generate a trace that crosses a cluster boundary, check the receiving cluster's spans for a valid parent span ID. If the parent span ID is missing or zero, context propagation is being dropped at the boundary. The fix is usually a header passthrough rule in the gateway configuration — a one-line change, but it's invisible until you specifically test for it.

For service meshes that span multiple clusters (Istio multi-cluster or Linkerd multi-cluster), the service mesh handles intra-cluster propagation but the inter-cluster traffic often goes through a gateway that isn't mesh-aware. Treat cluster boundaries as explicit propagation checkpoints that need individual verification, not as locations where propagation "should just work."

Connecting Traces to Logs: The Integration That Makes Traces Useful

A trace without a log connection is a latency waterfall. Useful for performance analysis, less useful for debugging application errors. The integration that makes traces genuinely actionable during incidents: inject the current trace context into your structured log output, so every log line emitted during a traced operation carries the trace ID and span ID.

// Go: inject trace context into zap logger fields
func loggerFromContext(ctx context.Context, base *zap.Logger) *zap.Logger {
  span := trace.SpanFromContext(ctx)
  sc := span.SpanContext()
  if !sc.IsValid() {
    return base
  }
  return base.With(
    zap.String("trace_id", sc.TraceID().String()),
    zap.String("span_id",  sc.SpanID().String()),
  )
}

With this in place, the investigation path becomes: error log → trace ID → full trace waterfall → correlated spans from all participating services. The trace gives you the structural context (which services were involved, call sequence, timing); the logs give you the application-level detail (the exact error message, the specific record that failed, the business context). Together they're substantially more useful than either signal alone.

The gap this closes isn't just convenience — it's the difference between "payment-service returned an error" and "payment-service returned an error because the downstream risk-scoring service returned a 503 on this specific transaction ID after a 2.1-second timeout." The first lives in logs or traces alone. The second requires both, connected by trace ID.