Static membership saved our rebalance budget

We have a Kafka consumer group with 40 instances reading from a 200-partition topic. Each rolling deploy used to trigger a sequence of rebalances: every restarted pod left the group, the broker reassigned partitions across the survivors, then the pod came back and another rebalance reassigned again. Eager rebalance protocol means processing stops for everyone for the duration. With 40 restarts and ~3 s per rebalance, that is roughly two minutes of stop-the-world per deploy.

Two changes brought it down to a few seconds total.

Stable instance IDs

Since KIP-345 (Kafka 2.3+), the consumer can announce a group.instance.id. The coordinator treats the consumer as a known returning member rather than a new one. As long as the consumer comes back within session.timeout.ms, no rebalance fires: the coordinator just keeps holding the partitions for it.

The trick is that group.instance.id must be stable across restarts of the same logical instance. We use the pod name from the StatefulSet for that:

// In the StatefulSet pod spec
- name: GROUP_INSTANCE_ID
  valueFrom:
    fieldRef:
      fieldPath: metadata.name      // app-consumer-0, app-consumer-1, ...

// In the consumer config
config.Consumer.Group.InstanceId = os.Getenv("GROUP_INSTANCE_ID")

Deployment becomes a Deployment-to-StatefulSet migration, which is the part that took the most calendar time.

Tuning the session timeout

Default session.timeout.ms on the broker side is 45 s, with a min of 6 s. If your pod restart takes longer than the timeout, the coordinator gives up on you and rebalances anyway. We measured P95 pod restart at about 35 s (pull image, run init container, start consumer, finish in-flight assign), so we bumped it to 60 s. Tradeoff: detection of an actually-dead consumer is slower, but for batch processing that is fine.

Co-operative rebalance protocol

If you cannot do static membership for some reason (Deployment is hard to give up, no StatefulSet, no KIP-345 client), the second-best option is the cooperative protocol. It rebalances only the moving partitions instead of revoking everything and reassigning. Sarama supports it through config.Consumer.Group.Rebalance.GroupStrategies; the Sarama docs show the exact wiring.

Result

Stop-the-world per deploy dropped from ~120 s to under 5 s, all of which is the protocol overhead of generation bumps. Throughput recovers within a single batch. The lag chart looks like a flat line with a tiny hiccup instead of a deep valley.