Delivering disordered messages during shutdown(rebalance), causing message loss #255

longquanzheng · 2018-08-09T23:30:21Z

We are experiencing msg loss during restarting workers, it happens consistently.
But when I added these logs, the message loss disappear.

longquanzheng@0c61134

I think this is because the logging slow down CommitOffsets() function. There must be some race condition with it.
Do you have any idea?

The text was updated successfully, but these errors were encountered:

dim · 2018-08-10T07:38:39Z

@longquanzheng I am currently working on implementing cluster functionality into sarama directly - IBM/sarama#1099. Only a few small bits are missing at this point and I aim to finish those next week.

dim · 2018-08-10T07:40:44Z

@longquanzheng I am nevertheless happy to fix the race if you can identify how it happens. We have quite a "fuzzy" test in our suite (see https://github.com/bsm/sarama-cluster/blob/master/consumer_test.go#L284) which seems to pass without losing any messages.

longquanzheng · 2018-08-10T20:22:43Z

@dim I found that sarama-cluster would deliver messages out of order starting from "Close()" is called to shutdown the library.

Here is an example: worker A owns the partition, consuming 1,2,3,...,9,10, everything is good, and then we call Close() to shutdown, it starts to receive lots of misordered messages, for example it got 100 (jump over 11 ~~ 99).
If worker A commit any of these, then worker B will have to start from 101, which means we have to miss 11~99. However if worker A doesn't commit 100, then worker B can still continue from 11, then we won't miss any messages.

To mitigate this issue, I enforce our consuming process to sleep for 2 seconds before shutdown sarama-cluster, and it works.(uber/cadence@ec218d7 )

But this is not a final solution. We want to understand why sarama-cluster starts to deliver out of order when we call(Close)

Note that this issue only repro in our production when hosts are busy, where we run lots of processes concurrently. We are not able to reproduce it in laptop or some idle hardware.

dim · 2018-08-13T09:41:44Z

@longquanzheng from what I know, messages are only ordered per topic/partition. also, Close() doesn't do anything specific i.e. we are just closing the consumer and waiting for exit. finally, the test I have mentioned above is testing this case, i.e. it starts consumers/shuts them down/starts new ones and no messages are being missed, neither locally nor on the CI. I am not sure how to debug this as this doesn't seem to happen neither to us, nor has it been reported by someone else before.

longquanzheng · 2018-08-14T22:57:13Z

@dim we finally found that the bug is here: georgeteo/sarama@9618a79

georgeteo · 2018-08-15T00:27:57Z

Actually, the issue is with the sarama/sarama-cluster contract. Specifically, in https://github.com/bsm/sarama-cluster/blob/master/partitions.go#L89, you use partitonConsumer.Close(), which will drain the Messages channel.

We use the PartitionConsumer abstraction from sarama-cluster, and on Close or rebalance, we continue reading from the Messages channel, but because Sarama is also reading from that channel, we end up with holes in our stream (e.g., we get msg 100, sarama drain takes msg 101, etc). Then due to a race condition in the shutdown procedure, we may commit msg 102 and lose messages.

The proposed fix would be to use AsyncClose in sarama-cluster partitions.go:89.

georgeteo · 2018-08-15T06:22:47Z

This is the PR with the fix for this: #258.

georgeteo · 2018-08-15T15:35:53Z

Thanks for accepting the PR. Can you tag a new release as well?

dim · 2018-08-16T08:33:56Z

@georgeteo looks like you didn't run the tests 😄

danpmx · 2018-08-16T14:18:34Z

I've tried #258 fix today and reverted back.
Looks like something is wrong, consumer is not consuming messages
when listening to Notifications channel, I receive infinite rebalance error messages

jpiper · 2018-08-16T15:55:34Z

By the way #258 worked for me with kafkaConfig.Group.Mode = cluster.ConsumerModePartitions it was only in the default mode that crashed for me

georgeteo · 2018-08-17T17:28:26Z

@danpmx, @jpiper, I'm unable to reproduce the crash. Can you post your consumer configuration?

When running the following non partition consumer code, I don't see either crashing or infinite rebalance:

func main() {
	config := cluster.NewConfig()
	config.Group.Return.Notifications = true
	config.Consumer.Return.Errors = true
	config.Consumer.Offsets.Initial = sarama.OffsetOldest
	consumer, err := cluster.NewConsumer([]string{"localhost:9092"}, "cg", []string{"test01"}, config)
	if err != nil {
		log.Fatal("unable to start consumer", err)
	}

	for {
		select {
		case msg, ok := <- consumer.Messages():
			if !ok {
				log.Println("ERROR: message channel is closed")
				continue
			}
			log.Printf("INFO: received offset %d from %s-%d\n", msg.Offset, msg.Topic, msg.Partition)
		case err, ok := <- consumer.Errors():
			if !ok {
				log.Println("ERROR: consumer channel is closed")
				continue
			}
			log.Printf("INFO: received error %s from sarama-cluster", err.Error())
		case ntf, ok := <- consumer.Notifications():
			if !ok {
				log.Println("ERROR: notification channel is closed")
				continue
			}
			log.Printf("INFO: received notification %s from sarama-cluster", ntf.Type.String())
		}
	}
}

Two consumer workers:

Worker 1

 ./sarama-test
2018/08/17 10:22:43 INFO: received notification rebalance start from sarama-cluster
2018/08/17 10:22:47 INFO: received notification rebalance OK from sarama-cluster
2018/08/17 10:22:48 INFO: received offset 19 from test01-0
2018/08/17 10:22:48 INFO: received offset 20 from test01-0
2018/08/17 10:22:48 INFO: received offset 21 from test01-0
# Worker 2 join here
2018/08/17 10:22:51 INFO: received notification rebalance start from sarama-cluster
2018/08/17 10:22:51 INFO: received notification rebalance OK from sarama-cluster
2018/08/17 10:23:00 INFO: received offset 22 from test01-0
2018/08/17 10:23:02 INFO: received offset 23 from test01-0
2018/08/17 10:23:03 INFO: received offset 24 from test01-0
2018/08/17 10:23:04 INFO: received offset 25 from test01-0
# Worker 2 leave here. 
2018/08/17 10:23:12 INFO: received notification rebalance start from sarama-cluster
# Worker 1 leaves.

Worker 2

# worker 1 leaves
./sarama-test
2018/08/17 10:23:10 INFO: received notification rebalance start from sarama-cluster
2018/08/17 10:24:06 INFO: received notification rebalance error from sarama-cluster
2018/08/17 10:24:06 INFO: received notification rebalance start from sarama-cluster
2018/08/17 10:24:09 INFO: received notification rebalance OK from sarama-cluster
2018/08/17 10:24:09 INFO: received offset 26 from test01-0
2018/08/17 10:24:09 INFO: received offset 27 from test01-0
2018/08/17 10:24:09 INFO: received offset 28 from test01-0
2018/08/17 10:24:09 INFO: received offset 29 from test01-0
2018/08/17 10:24:27 INFO: received notification rebalance start from sarama-cluster
2018/08/17 10:24:27 INFO: received notification rebalance OK from sarama-cluster

@dim: do you have any clues why non partition consumer might be broken with my recent change?

jpiper · 2018-08-17T19:27:10Z

@georgeteo I’m using this config

kafkaConfig := cluster.NewConfig()
sarama.MaxResponseSize = 104857600
sarama.MaxRequestSize = 104857600

kafkaConfig.Version = sarama.V1_1_0_0 // Sarama will default to 0.8
kafkaConfig.Group.PartitionStrategy = cluster.StrategyRoundRobin
kafkaConfig.Consumer.Return.Errors = true
kafkaConfig.Group.Return.Notifications = true
kafkaConfig.ChannelBufferSize = 1000

venkat1109 · 2018-08-17T19:34:34Z

I am able to repro behavior identical to what @danpmx reported. I digged into this and the root cause appears to be a deadlock in the underlying sarama library (which existed even before the fix added by george). But the new fix caused this deadlock to manifest itself differently i.e. after the fix, its a bunch of rebalance errors; before the fix, the deadlock will lead to consumer not receiving any messages at all. Following is the potential bug I discovered:

Setup
- Topic with two partitions
- Each partition has about 10k messages
- partition ChannelBufferSize=32, dwellTimer=1s and maxProcessingTime=250ms
- Consumer started in multiplexed mode; sleeps 1millis after processing each message. Initially sleeps for a second before consuming the first message
- I trigger rebalances every 3-4s or so
Deadlock
- Both partition consumers are started by the sarama library
- Partition consumer blocks for a while and abandons subscription link_1 link_2 because upstream is slow
- While its in this abandoned state, a rebalance is triggered
- As part of rebalance, saramacluster.consumer.nextTick() calls release() to release all subs
- Prior to the fix added by george (current master), this release() call will block forever
- After the fix added by george, Close() is replaced by AsyncClose(), so release() will finish, but then nextTick.subscribe() will fail with this error. This goes on in a loop and will result in infinite rebalance errors

So, I see two issues now:

the message loss originally reported by this issue
the deadlock which exist even before this fix

Whoever Owns Sarama: Verify if my analysis above is valid and update this ticket.

danpmx · 2018-08-19T10:59:14Z

@georgeteo our configuration:

conf := cluster.NewConfig()
conf.ChannelBufferSize = 1000
conf.Consumer.Return.Errors = true
conf.Group.Return.Notifications = true
conf.Group.Mode = cluster.ConsumerModePartitions

imjustfly · 2018-08-23T14:48:02Z

@venkat1109 The release() call will not block, because sarama.PartitionConsumer.Close() starts a goroutine to drain the messages channel. This is why a rebalance will lead to data loss.

rebalance -> release -> PartitionConsumer.Close() -> drain messages -> data loss

venkat1109 · 2018-08-23T14:54:12Z

@imjustfly - sarama.PartitionConsumer.Close() will block because of this for range loop (and not because of drain, which runs in its own goroutine):
https://github.com/Shopify/sarama/blob/master/consumer.go#L431

i.e. sarama.PartitionConsumer.Close() will return only after child.errors channel is closed.

imjustfly · 2018-08-23T16:01:57Z

@venkat1109 child.errors is closed:

Close() -> AyncClose() -> close(dying) -> close(trigger) -> close(feeder) -> close(errors)

venkat1109 · 2018-08-23T17:13:11Z

@imjustfly please see the sequence of steps I described above.

You are right that AsyncClose() code path is supposed to close feeder
But when upstream is slow, responseFeeder routine will block here (because of expiryTicker) so, feeder will never get closed

- Shutdown all partitions before shutting down the sarama consumer. This sidesteps bsm/sarama-cluster#255 and ensures that the shutdown completes in a reasonable timeframe. - Wait for PartitionConsumer shutdown before consuming messages - Use Sarama's PartitionConsumer mock instead of relying on our own because it is richer and well tested.

longquanzheng changed the title ~~Race condition in CommitOffsets()~~ Delivering disordered messages during shutdown Aug 10, 2018

longquanzheng changed the title ~~Delivering disordered messages during shutdown~~ Delivering disordered messages during shutdown(rebalance) Aug 10, 2018

longquanzheng changed the title ~~Delivering disordered messages during shutdown(rebalance)~~ Delivering disordered messages during shutdown(rebalance), causing message loss Aug 10, 2018

longquanzheng mentioned this issue Aug 10, 2018

Mitigate kafka message loss issue uber/cadence#1040

Closed

This was referenced Aug 17, 2018

Pin to sarama cluster without message drain race condition on rebalance uber-go/kafka-client#65

Merged

Revert "Pin to sarama cluster without message drain race condition on… uber-go/kafka-client#66

Merged

vprithvi mentioned this issue Aug 22, 2018

Shutdown sarama consumer cleanly jaegertracing/jaeger#987

Merged

imjustfly mentioned this issue Aug 23, 2018

try fixing a PartitionConsumer's race condition IBM/sarama#1156

Merged

yurishkuro mentioned this issue Sep 5, 2018

Remove dependency on sarama-cluster (Kafka driver) jaegertracing/jaeger#1052

Open

longquanzheng closed this as completed Jun 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delivering disordered messages during shutdown(rebalance), causing message loss #255

Delivering disordered messages during shutdown(rebalance), causing message loss #255

longquanzheng commented Aug 9, 2018 •

edited

dim commented Aug 10, 2018

dim commented Aug 10, 2018

longquanzheng commented Aug 10, 2018 •

edited

dim commented Aug 13, 2018

longquanzheng commented Aug 14, 2018

georgeteo commented Aug 15, 2018

georgeteo commented Aug 15, 2018

georgeteo commented Aug 15, 2018

dim commented Aug 16, 2018

danpmx commented Aug 16, 2018 •

edited

jpiper commented Aug 16, 2018

georgeteo commented Aug 17, 2018

jpiper commented Aug 17, 2018

venkat1109 commented Aug 17, 2018 •

edited

danpmx commented Aug 19, 2018 •

edited

imjustfly commented Aug 23, 2018

venkat1109 commented Aug 23, 2018

imjustfly commented Aug 23, 2018

venkat1109 commented Aug 23, 2018

Delivering disordered messages during shutdown(rebalance), causing message loss #255

Delivering disordered messages during shutdown(rebalance), causing message loss #255

Comments

longquanzheng commented Aug 9, 2018 • edited

dim commented Aug 10, 2018

dim commented Aug 10, 2018

longquanzheng commented Aug 10, 2018 • edited

dim commented Aug 13, 2018

longquanzheng commented Aug 14, 2018

georgeteo commented Aug 15, 2018

georgeteo commented Aug 15, 2018

georgeteo commented Aug 15, 2018

dim commented Aug 16, 2018

danpmx commented Aug 16, 2018 • edited

jpiper commented Aug 16, 2018

georgeteo commented Aug 17, 2018

Two consumer workers:

Worker 1

Worker 2

jpiper commented Aug 17, 2018

venkat1109 commented Aug 17, 2018 • edited

danpmx commented Aug 19, 2018 • edited

imjustfly commented Aug 23, 2018

venkat1109 commented Aug 23, 2018

imjustfly commented Aug 23, 2018

venkat1109 commented Aug 23, 2018

longquanzheng commented Aug 9, 2018 •

edited

longquanzheng commented Aug 10, 2018 •

edited

danpmx commented Aug 16, 2018 •

edited

venkat1109 commented Aug 17, 2018 •

edited

danpmx commented Aug 19, 2018 •

edited