Tail sampling with the OpenTelemetry collector

Head sampling is what most SDKs do by default. The very first service in the trace flips a weighted coin; if it lands wrong, the trace is dropped at every downstream hop. It is cheap and it gets you a uniform sample of all traffic, but it cannot prefer the traces you actually care about, because at the time of the coin flip the trace has not happened yet.

Tail sampling moves the coin to the end. The tail_sampling processor in the OTel collector buffers spans by trace ID, waits for the root span to close (or for a timeout), then runs a policy against the assembled trace. You can keep all errors, all traces slower than 1 s, and 1 % of everything else. Head sampling cannot do any of those.

The price

Memory. The collector holds every active trace in memory until the decision wait expires. At our peak that is a few million traces per minute, with a 30-second wait, which means tens of GB resident per collector instance. The two knobs that matter are:

processors:
  tail_sampling:
    decision_wait: 30s          # how long to wait for the trace to finish
    num_traces: 5000000         # in-flight trace cap
    expected_new_traces_per_sec: 100000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

If num_traces is too low, traces get evicted before the decision wait expires and you sample blindly. If decision_wait is too short, slow traces that have not finished yet do not get the chance to be flagged by the latency policy and you miss the most interesting ones.

Sharding by trace ID

The decision policy has to see every span of the same trace. If you run the collector horizontally, you need a load balancer in front that hashes by trace ID — otherwise spans for one trace land on different collector instances and none of them see a complete trace.

The collector ships a loadbalancing exporter that does this. Put a stateless first tier in front (any OTLP receiver), have it route to a tail-sampling second tier with trace-ID-consistent hashing, send the kept traces to the backend from the second tier. Three boxes per role minimum, otherwise loss of one tier-2 collector loses the traces it was buffering.

What I keep

All traces with at least one ERROR span.
All traces where end-to-end latency > p99 baseline (we recompute the threshold daily).
1 % of everything else, deterministically hashed on trace ID so two services see the same sample.

That keeps storage flat against traffic growth and the interesting tail covered.