Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure checkPendingRequests is run when the throttle delay is marginal #1532

Merged
merged 1 commit into from
Feb 27, 2023

Conversation

ekanth
Copy link
Contributor

@ekanth ekanth commented Feb 17, 2023

We have observed 10 minute pauses on consumers on fetch and it recovers after the Kafka broker resets the connection. Here are the logs from the client:

1:12:48.629 PM	
     correlationId: 67644
     expectResponse: true
     logger: kafkajs
     message: Request Fetch(key: 1, version: 11)
     namespace: Connection
     size: 143
1:12:48.630 PM	
     correlationId: 67644
     logger: kafkajs
     message: Request enqueued
     namespace: RequestQueue
1:22:48.632 PM	
     logger: kafkajs
     message: Kafka server has closed connection
     namespace: Connection

On debugging this further and reading the code, we observed that if the throttle delay is marginal (say in this case 1ms), in push, the request will not be sent immediately. When scheduleCheckPendingRequests is called, it computes this.throttleUntil - Date.now() again. When Data.now() progresses 1ms (race condition), this.throttleUntil - Date.now() is zero and checkPendingRequests() is not scheduled at all. Now this connection goes into a limbo until the brokers kills it based on connections.max.idle.ms configuration on broker (which defaults to 10 minutes).

This fixes the root cause of the same issue called out in #660.

The code changes:

  • Ensure checkPendingRequests is run even when the throttle delay is marginal. When there are pending requests, we keep scheduling checkPendingRequests() periodically. I am open to changing the default interval if there are concerns.
  • Ensure maybeThrottle accounts for negative values. This is extra protective and Java client does the same.
  • (not specifically related to this bug but noticed it) Move the lock acquisition code out of try block to ensure we only release (in the finally) the lock after it is taken successfully
  • Add a test that exposes the problem which passes with the fix.

let scheduleAt = this.throttledUntil - Date.now()
if (!this.throttleCheckTimeoutId) {
if (this.pending.length > 0) {
scheduleAt = scheduleAt > 0 ? scheduleAt : CHECK_PENDING_REQUESTS_INTERVAL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it that we need a minimum timeout here? I'm thinking that if the throttle time was marginal, say 1ms, and 1ms has elapsed by the time we check it, then this.throttledUntil - Date.now() will be either zero or negative. At that point, couldn't we just set the timeout value to Math.max(scheduleAt, 0)? That would just result in the pending check request happening on the next turn around the event loop, which is what we want in this case since the throttle period has already ended and we don't want to delay the requests further.

The only problem I see with that is that we need to make sure this.throttledUntil is cleared after we've scheduled the check, so that we don't end up always doing setTimeout(..., 0) as soon as we've been throttled once.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nevon for the quick review!

Doing Math.max(scheduleAt, 0) has the exact problem you mentioned - checkPendingRequests() and scheduleCheckPendingRequests() keep calling each other in a loop without an exit criteria. Given pending is tracking requests both because of throttling and max inflight, I felt the simplest way to implement this would be to use a minimum timeout that kicks in when pending is not empty and stop otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nevon - would love to hear if you have any follow-up thoughts on the approach for this fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm sold on the idea of introducing up to a 10ms delay whenever there's been marginal throttling, but it's still an improvement over the current situation. This can always be improved if it turns out that it actually happens with some amount of frequency.

@ekanth
Copy link
Contributor Author

ekanth commented Feb 27, 2023

@Nevon - I'd appreciate another look and help in merging this PR. This issue has been happening to us and would love to merge this and have an official release that fixes this.

@ekanth ekanth requested a review from Nevon February 27, 2023 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants