Skip to content

Consumer Lag Evaluation Rules

Srinivasulu Punuru edited this page Nov 15, 2018 · 8 revisions

The status of a consumer group in Burrow is determined based on several rules evaluated against the offsets for each partition the group consumes. By establishing a set of rules that determines what is "good" consumer behavior and what is "bad" behavior, we can check whether or not the consumer is performing well without the need for setting a discrete threshold for the number of messages a consumer is allowed to be behind before alerts go off. By evaluating against every partition the group consumes, we assure that the entire consumer group is healthy, and not just the one or two topics that are being monitored. This is especially important for wildcard consumers, such as Kafka Mirror Maker.

Evaluation Window

The Storage subsystem intervals configuration determines the length of our sliding window. This specifies the number of offsets to store for each partition that a consumer group consumes. The default setting is 10 which, when combined with an offset commit interval (configured on the consumer) of 60 seconds, means we evaluate the status of a consumer group over 10 minutes. This window moves forward with each offset the consumer commits (the oldest offset is removed when the new offset is added).

For each consumer offset, we store the offset itself, the timestamp that the consumer committed it, and the lag at the point Burrow received it. The lag is calculated as difference between the HEAD offset of the broker and the consumer's offset. Because broker offsets are fetched on a fixed interval, it is possible for this to result in a negative number. If this happens, the stored lag value is zero.

Evaluation Rules

The following rules are used for evaluation of a group's status for a given partition:

  1. If any lag within the window is zero, the status is considered to be OK.
  2. If the consumer offset does not change over the window, and the lag is either fixed or increasing, the consumer is in an ERROR state, and the partition is marked as STALLED. Consumer is still committing offsets.
  3. If the consumer offsets are increasing over the window, but the lag either stays the same or increases between every pair of offsets, the consumer is in a WARNING state. This means that the consumer is slow, and is falling behind.
  4. If the difference between the time now and the time of the most recent offset is greater than the difference between the most recent offset and the oldest offset in the window, the consumer is in an ERROR state and the partition is marked as STOPPED. However, if the consumer offset and the current broker offset for the partition are equal, the partition is not considered to be in error. Consumer has not recently committed offsets.
  5. If the lag is -1, this is a special value that means we do not have a broker offset yet for that partition. This only happens when Burrow is starting up, and the status is considered to be OK.

Burrow will attempt to evaluate partitions early, before the window of offsets is full. This can lead to some false information as Burrow is starting up and picking up offset commits. The "complete" flag in the group status will specify, as a percentage, how much data Burrow has to work with.

Examples

For the following examples, the header (W1, W2, W3, etc.) indicates the stored offset information within the window, from oldest to newest. The offset and lag are the committed offset and calculated lag for that interval, and the timestamp is a relative number of seconds from an arbitrary time (T+60 is our base time plus 60 seconds) Example 1 For partition 0 of topicA, our group has the following stored offsets:

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
Offset 10 20 30 40 50 60 70 80 90 100
Lag 0 0 0 0 0 0 1 3 5 5
Timestamp T T+60 T+120 T+180 T+240 T+300 T+360 T+420 T+480 T+540

This partition is considered to be OK based on evaluation of rule #1 (the lag was zero at some point during the window). If the lag continues to stay the same or increase, this partition could move into a WARNING state after 6 more offset commits (once the zeros clear out).

Example 2 For partition 0 of topicA, our group has the following stored offsets:

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
Offset 10 10 10 10 10 10 10 10 10 10
Lag 1 1 1 1 1 2 2 2 3 3
Timestamp T T+60 T+120 T+180 T+240 T+300 T+360 T+420 T+480 T+540

This partition is considered to be in a STALLED state, and the consumer status will be marked as ERROR, based on evaluation of rule #2. The consumer is committing offsets, but they are not advancing and there is lag that is persisting. The consumer is running, but it is not consuming messages on this partition.

Example 3 For partition 0 of topicA, our group has the following stored offsets:

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
Offset 10 20 30 40 50 60 70 80 90 100
Lag 1 1 1 1 1 2 2 2 3 3
Timestamp T T+60 T+120 T+180 T+240 T+300 T+360 T+420 T+480 T+540

This partition is considered to be in a WARNING state, based on evaluation of rule #3. The consumer is moving forwards, but the lag is increasing over the entire window. This means the consumer is falling behind. This is only considered to be a warning because the consumer has not failed, it is just slow.

Example 4 For partition 0 of topicA, our group has the following stored offsets:

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
Offset 10 20 30 40 50 60 70 80 90 100
Lag 5 3 5 2 1 1 2 1 4 6
Timestamp T T+60 T+120 T+180 T+240 T+300 T+360 T+420 T+480 T+540

This partition is considered to be in an OK state, based on evaluation of rule #3. Even though the lag has increased between the first and last offsets stored, between at least two points (such as W3 and W4) the lag decreased. This means that the consumer is moving forwards and was catching up during part of the window. This pattern is common for topics that are busy.

Example 5 The partition offset information is exactly the same as in example 4, above. The difference is that it is now T+1200 when the status evaluation is performed.

The partition is now considered to be in a STOPPED state, and the consumer status will be marked as ERROR, based on evaluation of rule #4. The difference in time between the first offset stored and the last offset stored is 540 seconds, and the difference in time between now and the last offset stored is 660 seconds. The consumer has stopped committing offsets, which means it has failed or has been stopped.

Offset Expiration

The Storage subsystem expire-group configuration specifies when a consumer should be considered to have gone away permanently. The newest offset for each partition is checked to see if it was received longer than the specified number of seconds ago, and if so, the partition is removed for the consumer. If all partitions in a topic are removed, the topic is removed as well. If all topics for the consumer group are removed, the group is removed from Burrow. It's recommended that this setting be long enough that you know the consumer has fallen off the beginning of the partition (that is, the last offset committed is older than the oldest offset the broker has). If the consumer begins committing offsets again, monitoring will resume from that point.