Fix "broker received out of order sequence" when brokers die #1661

KJTsanaktsidis · 2020-04-09T07:05:26Z

When the following three conditions are satisfied, the producer code can skip message sequence numbers and cause the broker to complain that the sequences are out of order:

config.Producer.Idempotent is set
The producer loses, and then regains, its connection to a broker
The client code continues to attempt to produce messages whilst the broker is unavailable.

For every message the client attempted to send while the broker is unavailable, the transaction manager sequence number will be incremented, however these messages will eventually fail and return an error to the caller. When the broker re-appears, and another message is published, it's sequence number is higher than the last one the broker remembered - the values that were attempted while it was down were never seen. Thus, from the broker's perspective, it's seeing out-of-order sequence numbers.

The fix to this has a few parts:

Don't obtain a sequence number from the transaction manager until we're sure we want to try publishing the message
Affix the producer ID and epoch to the message once the sequence is generated
Increment the transaction manager epoch (and reset all sequence numbers to zero) when we permanently fail to publish a message. That represents a sequence that the broker will never see, so the only safe thing to do is to roll over the epoch number.
Ensure we don't publish message sets that contain messages from multiple transaction manager epochs.

This should be a fix for #1430 I think. I tested it with this test harness: https://gist.github.com/KJTsanaktsidis/12a33a9e6e864857b91f639947567ac3 and tried a few things while it was running:

Drop broker traffic with iptables, wait for messages to be retried, and then re-enable broker traffic: this should not bump the epoch, because the messages are successfully published after retrying
Drop broker traffic with iptables, wait for retries to be exhaused and for Sarama to start reporting errors to the client code, and then re-enable traffic. This should bump the epoch, because publishes threw errors to sarama's caller after having caused a sequence number increment
Cleanly stop and start the kafka broker while the test is running

Seemed to all come out OK.

This is, however, very much still a WIP PR. In particular:
* ~~It needs tests~~ I added a test case that covers this.
* ~~It needs to be used in anger in one of our production systems~~ We've deployed this to production in Zendesk now over the last week
* ~~I couldn't get the e2e tests running on my machine~~ I provided some suggestions about how we could make the e2e tests work more easily on dev machines
* I very much would love your eyeballs on this to see if this looks like the right solution to you!

When the following three conditions are satisfied, the producer code can skip message sequence numbers and cause the broker to complain that the sequences are out of order: * config.Producer.Idempotent is set * The producer loses, and then regains, its connection to a broker * The client code continues to attempt to produce messages whilst the broker is unavailable. For every message the client attempted to send while the broker is unavailable, the transaction manager sequence number will be incremented, however these messages will eventually fail and return an error to the caller. When the broker re-appears, and another message is published, it's sequence number is higher than the last one the broker remembered - the values that were attempted while it was down were never seen. Thus, from the broker's perspective, it's seeing out-of-order sequence numbers. The fix to this has a few parts: * Don't obtain a sequence number from the transaction manager until we're sure we want to try publishing the message * Affix the producer ID and epoch to the message once the sequence is generated * Increment the transaction manager epoch (and reset all sequence numbers to zero) when we permenantly fail to publish a message. That represents a sequence that the broker will never see, so the only safe thing to do is to roll over the epoch number. * Ensure we don't publish message sets that contain messages from multiple transaction manager epochs.

KJTsanaktsidis · 2020-04-14T06:17:49Z

Think " CI / Go 1.14.x with Kafka 2.4.0 on Ubuntu" job just needs to be poked, looks like it failed before it even got to running the test suite

dnwe · 2020-04-14T09:31:55Z

@KJTsanaktsidis thanks for debugging this through and coming up with a solution. The changes that you propose here look good to me. It would be good if we could replicate this in the functional tests with toxiproxy, but that doesn't necessarily have to be done as part of this PR nor before merging.

I've noticed a few people now saying they've been unable to get the functional tests to run locally. I wonder if we should investigate making the,m simpler to run by using docker or k3s for the brokers rather than the existing vagrant-based setup (/cc @bai)

KJTsanaktsidis · 2020-04-14T09:47:06Z

FWIW I too didn't manage to get the functional tests to run properly either! (EDIT: Just realised I said this in my PR description, which is why you brought it up)

In theory it should be relatively simple to set up a functional testcase for this - all that's needed is to publish a bunch of messages, use toxiproxy to blackhole the broker for a while, and then bring the broker back and check that messages can still be published.

If i get a moment this week I'll see if I can wrestle with vagrant enough to get an integration test going. We're currently working out how to get enough confidence in this change to ship it to production, so integration tests could help us with that anyway.

bai · 2020-04-14T15:01:06Z

Yeah I've had that on my TODO but haven't had time to attack this to be honest. I've been looking into either docker-compose or k3s but personally leaning towards the latter.

I think current vagrant setup is somewhat outdated.

KJTsanaktsidis · 2020-04-14T23:17:01Z

I think the Vagrantfile won't have worked since it was bumped from trusty to bionic in b8c5f7c - setup_services.sh tries to set up classic upstart/sysv-init style jobs for toxiproxy/zookeeper/kafka.

This could probably be fixed by turning them into systemd units, but a more wholesale fix to use containers for this could simplify it a lot, you're right.

In any case i'll bodge something on my machine to try and get the tests to work, submit a functional test to this PR, and actually fixing the local functional test runner is probably a job for a different PR

KJTsanaktsidis · 2020-04-16T07:29:06Z

@bai @dnwe it was certainly an ordeal, but I got the integration tests running on my machine. What I ended up doing was making a docker-compose file with kafka/zookeeper/toxiproxy, and moving all of the topic creation/seeding stuff to the golang test code itself. I did have a look at k3s for this, but it doesn't seem to have a good macOS story.

This is it: zendesk@0650324

so, the tests can be run like:

docker-compose up -d
# wait for everything to calm down
TOXIPROXY_ADDR="http://localhost:8474" KAFKA_PEERS="localhost:29091,localhost:29092,localhost:29093,localhost:29094,localhost:29095" KAFKA_VERSION="2.4.1" CI="true" DEBUG="true"  make test

Could probably replace the CI runner with the docker compose file with a bit of tweaking too.

In any case, after all that, I did write a functional test and add it to this PR.

waichee · 2020-04-16T22:02:42Z

functional_producer_test.go

@@ -96,6 +97,83 @@ func TestFuncProducingToInvalidTopic(t *testing.T) {
 	safeClose(t, producer)
 }

+func TestFuncProducingIdempotentWithBrokerFailure(t *testing.T) {


KJTsanaktsidis · 2020-04-23T23:59:24Z

@bai @Nwe Just as a heads up, we've been shipping this change to production at Zendesk slowly over the last week. It's gone off without a hitch. So I think we can say this got some usage in anger now!

dnwe · 2020-04-24T00:18:54Z

@KJTsanaktsidis thanks for the update. Great news. Are you running against kafka 2.3/2.4 in the backend?

KJTsanaktsidis · 2020-04-24T00:26:48Z

Ha, unfortunately we're using kafka 2.1.1 in production (and 2.2.2 in our staging environment).

dnwe

@KJTsanaktsidis thanks for this fix 👍

dnwe · 2020-05-04T14:11:26Z

@KJTsanaktsidis did you want to also propose a PR for your vagrant --> docker-compose migration for the functional tests?

KJTsanaktsidis · 2020-05-05T00:01:14Z

@dnwe yeah, if you want to go in that direction I think I can put together a PR that has

A docker compose file that works on mac, linux & windows for running the functional tests
Tweak the github actions CI runners to use compose to run the tests
Get rid of the vagrant stuff since it now doesn't work

Sounds like a job for my next lab day, which is end of next week. Sounds like a plan?

dnwe · 2020-05-05T07:24:43Z

@KJTsanaktsidis that sounds great to me. I think making functional tests against a real kafka easier to run and hence easier to write will be extremely valuable.

@bai / @d1egoaz any thoughts?

bai · 2020-05-05T08:54:44Z

I'm 👍 on getting rid of Vagrant and simplifying this setup and I think docker-compose is a good candidate for this. I'd also suggest k3s but assuming docker-compose up is better in terms of simplicity, so we don't have to kubectl apply things. Thoughts?

dnwe · 2020-05-05T09:07:29Z

Yeah so we use k3s internally in our CI builds, but in this scenario it doesn't really buy us much above and beyond docker-compose as we'd invariably be running it within docker (for non-Linux devs) so it would just add an additional layer and we'd have the complexities of statefulset yaml or adopting something like strimzi, but all of that increases the complexity of things for devs to run/understand so I think compose is a reasonable solution.

The alternative would be to use the docker go bindings and drive the start and stop of containers directly from the tests themselves and not having any "bring up the cluster" outside of Go at all

bai · 2020-05-05T09:38:48Z

Makes sense, thanks. So my vote goes for docker-compose then 😄

KJTsanaktsidis requested a review from bai as a code owner April 9, 2020 07:05

KJTsanaktsidis mentioned this pull request Apr 9, 2020

Idempotent producer broken on broker reconnect #1430

Closed

KJTsanaktsidis force-pushed the ktsanaktsidis/fix_seq_out_of_order branch 4 times, most recently from 9c6582c to bb8dd43 Compare April 14, 2020 05:54

KJTsanaktsidis force-pushed the ktsanaktsidis/fix_seq_out_of_order branch from bb8dd43 to 9df3038 Compare April 14, 2020 06:05

Add a functional test to cover idempotent production

ca14191

KJTsanaktsidis force-pushed the ktsanaktsidis/fix_seq_out_of_order branch from 955b7df to ca14191 Compare April 16, 2020 07:36

waichee reviewed Apr 16, 2020

View reviewed changes

dnwe approved these changes May 4, 2020

View reviewed changes

dnwe merged commit f7b64cf into IBM:master May 4, 2020

KJTsanaktsidis mentioned this pull request May 10, 2020

Use docker-compose to run the functional tests #1695

Merged

songxinjianqwe mentioned this pull request Aug 21, 2023

Transactional Producer send msg always got ErrOutOfOrderSequenceNumber error when brokers reconnected #2611

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "broker received out of order sequence" when brokers die #1661

Fix "broker received out of order sequence" when brokers die #1661

KJTsanaktsidis commented Apr 9, 2020 •

edited

KJTsanaktsidis commented Apr 14, 2020

dnwe commented Apr 14, 2020

KJTsanaktsidis commented Apr 14, 2020 •

edited

bai commented Apr 14, 2020

KJTsanaktsidis commented Apr 14, 2020

KJTsanaktsidis commented Apr 16, 2020

waichee Apr 16, 2020

KJTsanaktsidis commented Apr 23, 2020

dnwe commented Apr 24, 2020

KJTsanaktsidis commented Apr 24, 2020

dnwe left a comment

dnwe commented May 4, 2020

KJTsanaktsidis commented May 5, 2020 •

edited

dnwe commented May 5, 2020

bai commented May 5, 2020 •

edited

dnwe commented May 5, 2020

bai commented May 5, 2020

Fix "broker received out of order sequence" when brokers die #1661

Fix "broker received out of order sequence" when brokers die #1661

Conversation

KJTsanaktsidis commented Apr 9, 2020 • edited

KJTsanaktsidis commented Apr 14, 2020

dnwe commented Apr 14, 2020

KJTsanaktsidis commented Apr 14, 2020 • edited

bai commented Apr 14, 2020

KJTsanaktsidis commented Apr 14, 2020

KJTsanaktsidis commented Apr 16, 2020

waichee Apr 16, 2020

Choose a reason for hiding this comment

KJTsanaktsidis commented Apr 23, 2020

dnwe commented Apr 24, 2020

KJTsanaktsidis commented Apr 24, 2020

dnwe left a comment

Choose a reason for hiding this comment

dnwe commented May 4, 2020

KJTsanaktsidis commented May 5, 2020 • edited

dnwe commented May 5, 2020

bai commented May 5, 2020 • edited

dnwe commented May 5, 2020

bai commented May 5, 2020

KJTsanaktsidis commented Apr 9, 2020 •

edited

KJTsanaktsidis commented Apr 14, 2020 •

edited

KJTsanaktsidis commented May 5, 2020 •

edited

bai commented May 5, 2020 •

edited