Head sampling is what most SDKs do by default. The very first service in the trace flips a weighted coin; if it lands wrong, the trace is dropped at every downstream hop. It is cheap and it gets you a uniform sample of all traffic, but it cannot prefer the traces you actually care about, because at the time of the coin flip the trace has not happened yet.
Tail sampling moves the coin to the end. The tail_sampling processor in the OTel collector buffers spans by trace ID, waits for the root span to close (or for a timeout), then runs a policy against the assembled trace. You can keep all errors, all traces slower than 1 s, and 1 % of everything else. Head sampling cannot do any of those.
The price
Memory. The collector holds every active trace in memory until the decision wait expires. At our peak that is a few million traces per minute, with a 30-second wait, which means tens of GB resident per collector instance. The two knobs that matter are:
processors:
tail_sampling:
decision_wait: 30s # how long to wait for the trace to finish
num_traces: 5000000 # in-flight trace cap
expected_new_traces_per_sec: 100000
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 1 }
If num_traces is too low, traces get evicted before the decision wait expires and you sample blindly. If decision_wait is too short, slow traces that have not finished yet do not get the chance to be flagged by the latency policy and you miss the most interesting ones.
Sharding by trace ID
The decision policy has to see every span of the same trace. If you run the collector horizontally, you need a load balancer in front that hashes by trace ID โ otherwise spans for one trace land on different collector instances and none of them see a complete trace.
The collector ships a loadbalancing exporter that does this. Put a stateless first tier in front (any OTLP receiver), have it route to a tail-sampling second tier with trace-ID-consistent hashing, send the kept traces to the backend from the second tier. Three boxes per role minimum, otherwise loss of one tier-2 collector loses the traces it was buffering.
What I keep
- All traces with at least one ERROR span.
- All traces where end-to-end latency > p99 baseline (we recompute the threshold daily).
- 1 % of everything else, deterministically hashed on trace ID so two services see the same sample.
That keeps storage flat against traffic growth and the interesting tail covered.