Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix brokers continually allocating new Session IDs #1644

Merged
merged 1 commit into from Apr 9, 2020

Conversation

KJTsanaktsidis
Copy link
Contributor

KIP-227 says that if a client provides a session ID of 0 and a session epoch of 0 in a FetchRequest, the broker should allocate a new sessionID. Then, the consumer can use this ID to avoid repeating the list of partitions to consume on every call. I figured this out from the table in KIP-227 here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability#KIP-227:IntroduceIncrementalFetchRequeststoIncreasePartitionScalability-FetchRequestMetadatameaning (let me know if I got this wrong!)

Sarama is currently sending 0 for both values (I validated this with a packet capture), but totally ignoring the session ID generated by the broker. This means that the broker is allocating a new session ID on every request, filling its session ID cache and leading to FETCH_SESSION_ID_NOT_FOUND logs being written by the broker because other actual session IDs are evicted (from other consumers). These look like this

[ReplicaFetcher replicaId=6, leaderId=2, fetcherId=9] Node 2 was unable to process the fetch request with (sessionId=1965743757, epoch=601): FETCH_SESSION_ID_NOT_FOUND.

Instead, we should set the epoch to -1, which instructs the broker not to allocate a new session ID.

I'm 90% sure I got the versioning right (using request.Verison = 7 for broker versions >= 1.1.0) - I got the info for this from here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75957546

@ghost ghost added the cla-needed label Mar 18, 2020
@KJTsanaktsidis
Copy link
Contributor Author

I've sent a request to our legal department to get the CLA signed.

@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_session_id_thrashing branch from d2343ed to 0c4ec76 Compare March 18, 2020 06:29
@bai
Copy link
Contributor

bai commented Mar 18, 2020

Thanks for contributing. I'm going to need to wait for CLA thing to pass before proceeding but LGTM.

@bai bai requested a review from dnwe March 18, 2020 06:54
Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KJTsanaktsidis thanks for finding this mistake and submitting the PR 👍

The changes look good to me. I'll raise a separate issue to track fully implementing incremental fetching (KIP-227) in the future. FYI the Kafka APIs page maintained by our own @mimaison is a good source of info when trying to determine what version of each api was supported by which Kafka version.

Note, I had also recently seen apache/kafka#7640 in this area which you might be interested in. I don't believe it has been backported from the (unreleased) 2.5 on trunk currently though.

@KJTsanaktsidis
Copy link
Contributor Author

Ah cool - yeah the kafka APIs page says that 1.1 has fetch API v7, so I think I've got it rigfht in this PR.

apache/kafka#7640 is interesting - I believe it would make the effect of this bug worse, because it's evicting the oldest sessions (opened by correctly-behaving consumers) in preference to least-recently-used ones (sessions created by the bug in this PR would not actually be ever used, so we'd like them to get evicted).

We've compared our broker logs with & without this patch, and this change definitely makes pretty much all of the FETCH_SESSION_ID_NOT_FOUND errors go away, so I don't think KAFKA-9137 is a huge problem for us except for how it interacts with this bug.

Thanks for the reviews! I'll let you know when we've executed the CLA.

@KJTsanaktsidis
Copy link
Contributor Author

So, bad news guys - I spoke to our legal people and they said that there's a bit of a process to go through, and COVID-19 is definitely slowing things down, so they reckon an ETA of about a month to get the CLA signed.

I don't know if you guys want to just wait, or independently fix the problem based on my PR description maybe? It's hard to see how you could actually fix it in a different way to what I did, but as long as the person who fixes it doesn't look at my code, it shouldn't matter if they come up with the same solution, right?

Again, really sorry about the legal troubles here.

@KJTsanaktsidis
Copy link
Contributor Author

Just an update on the CLA process - our legal people have OK'd it, just needs to be signed by leadership in the next couple of days and then we should be good to go!

@bai
Copy link
Contributor

bai commented Apr 8, 2020

Thanks for expediting this, look forward to getting this merged 🎉

KIP-227 says that if a client provides a session ID of 0 and a session
epoch of 0 in a FetchRequest, the broker should allocate a new
sessionID. Then, the consumer can use this ID to avoid repeating the
list of partitions to consume on every call.

Sarama is currently sending 0 for both values, but totally ignoring the
session ID generated by the broker. This means that the broker is
allocating a new session ID on every request, filling its session ID
cache and leading to FETCH_SESSION_ID_NOT_FOUND logs being written by
the broker because other _actual_ session IDs are evicted.

Instead, we should set the epoch to -1, which instructs the broker
not to allocate a new session ID that's just going to be ignored anyway.
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/fix_session_id_thrashing branch from 0c4ec76 to b9b3d09 Compare April 8, 2020 23:07
@ghost ghost removed the cla-needed label Apr 8, 2020
@KJTsanaktsidis
Copy link
Contributor Author

KJTsanaktsidis commented Apr 8, 2020

🎂 @bai we got the CLA signed, this change should be ready to rock! Thanks for your patience. I amended the commit to tickle CI into running the CLA check again.

@bai bai merged commit 6159078 into IBM:master Apr 9, 2020
@domano
Copy link

domano commented Apr 17, 2020

Hey, will there soon be a bugfix release version with this included? Currently this is causing quite some problems for our platform and i prefer referencing SemVer releases over commithashes.

Thanks :)

@mimaison
Copy link
Contributor

mimaison commented May 6, 2020

@bai This is also causing us issues. Is it possible to get a new release out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants