Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KREST-2746 Reflect slow responses from Kafka back to the http client #1043

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ehumber
Copy link
Member

@ehumber ehumber commented Jul 18, 2022

This PR adds code to reflect Kafka rate limiting back to the http client.

While the produce api response from Kafka does indicate that the client is being throttled to the Kafka Produce java client, this information is not exposed to the end user of that client (ie Kafka REST), so we have to infer the throttling another way.

Kafka throttles by delaying its response back to the client. So we can assume that if we are getting a growing backlog of requests waiting to be sent to kafka then kafka is throttling REST.

When this happens (or we have a backlog of calls to Kafka for some other reason) we are obliged to send responses back to kafka in the same order as the requests arrived with us, so all we can do is add 429 responses, once we hit the "throttle from this point "queue depth, to the end of the response queue. The requests for these are not sent to kafka, and so should reduce the traffic reaching kafka, possibly enough to bring it back under the throttle limit.

If the queue depth doesn't shrink sufficiently, and instead grows to the max limit, then the connection is chopped after the grace period in my previous (disconnect clients nicely) PR has expired.

@ehumber
Copy link
Member Author

ehumber commented Jul 18, 2022

@dimitarndimitrov The failing unit test I mentioned is testWriteToChunkedOutputAfterTimeout. (although the new 429 test fails in the same way too, I've not been testing that one)

@ehumber ehumber force-pushed the KREST-2746-reflect-kafka-slow-back-to-clients branch from baaf2ea to d3ef908 Compare July 18, 2022 13:40
@ehumber
Copy link
Member Author

ehumber commented Jul 18, 2022

@dimitarndimitrov ignore the first commit, I'd not tidied up properly. This one shows the hang behaviour from easymock where the close on the mappingIterator never returns:

logs look like

** has next delayed response
THREAD EXECUTOR TRIGGERSclass io.confluent.kafkarest.response.StreamingResponse$ComposingStreamingResponse
THREAD EXECUTOR closeAll from thread executor call through to a sub close class io.confluent.kafkarest.response.StreamingResponse$ComposingStreamingResponse
CLOSE could be from either THREAD or FINALLY ->  calls through to inputStreaming closeclass io.confluent.kafkarest.response.StreamingResponse$ComposingStreamingResponse
CLOSE could be either from THREAD or FINALLY -> calls through to inputStreaming closeclass io.confluent.kafkarest.response.StreamingResponse$InputStreamingResponse
CLOSE in json stream which calls mapping iterator close
delegate not nullclass com.fasterxml.jackson.databind.MappingIterator$$EnhancerByCGLIB$$4a178d4f

Process finished with exit code 130 (interrupted by signal 2: SIGINT)

@ehumber
Copy link
Member Author

ehumber commented Jul 18, 2022

and with the debugger running it looks like this, and passes the test.

There is a difference (after I said there wasn't) :)

the debug line writing to sink.

This only writes out if I have a breakpoint in the async writing to sink method (eg line 471), otherwise I see the same behaviour with and without the debugger

THREAD EXECUTOR TRIGGERSclass io.confluent.kafkarest.response.StreamingResponse$ComposingStreamingResponse
THREAD EXECUTOR closeAll from thread executor call through to a sub close class io.confluent.kafkarest.response.StreamingResponse$ComposingStreamingResponse
CLOSE could be from either THREAD or FINALLY ->  calls through to inputStreaming closeclass io.confluent.kafkarest.response.StreamingResponse$ComposingStreamingResponse
CLOSE could be either from THREAD or FINALLY -> calls through to inputStreaming closeclass io.confluent.kafkarest.response.StreamingResponse$InputStreamingResponse
[2022-07-18 14:51:09,785] DEBUG Writing to sink (io.confluent.kafkarest.response.StreamingResponse:477)
CLOSE in json stream which calls mapping iterator close
delegate not nullclass com.fasterxml.jackson.databind.MappingIterator$$EnhancerByCGLIB$$4a178d4f
delegate now closed
THREAD EXECUTOR closeAll from thread executor calling responsequeue.close
CLOSE response queue from FINALLY or THREAD class io.confluent.kafkarest.response.StreamingResponse$AsyncResponseQueue

So that's a nice big clue I'm going to go investigate :)

@ehumber ehumber force-pushed the KREST-2746-reflect-kafka-slow-back-to-clients branch 5 times, most recently from 700f956 to 3c4fe45 Compare July 19, 2022 21:33
@ehumber ehumber force-pushed the KREST-2746-reflect-kafka-slow-back-to-clients branch from 3c4fe45 to 9a95641 Compare July 19, 2022 21:41
@ehumber ehumber marked this pull request as ready for review July 19, 2022 21:50
@ehumber ehumber requested a review from a team as a code owner July 19, 2022 21:50
@ehumber
Copy link
Member Author

ehumber commented Oct 14, 2022

@dimitarndimitrov @AndrewJSchofield This PR could do with merging before mid-November if possible, so that we can deal with kafka back pressure before increasing (or removing) any byte based rate limits for produce.

It would be great if I could get some feedback by the end of October.

@@ -196,12 +216,24 @@ public final void resume(AsyncResponse asyncResponse) {
CompletableFuture.completedFuture(
ResultOrError.error(EXCEPTION_MAPPER.toErrorResponse(e))));
} finally {
// if there are still outstanding response to send back, for example hasNext has returned
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the MAX_CLOSE_RETRIES

private void triggerDelayedClose(
ScheduledExecutorService executorService, AsyncResponseQueue responseQueue) {
if (executorService == null) {
executorService = Executors.newSingleThreadScheduledExecutor();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is null, so it's not an object reference that we can overwrite/edit at this point.

Need to eg return the executor service and set the original object to that value

@@ -347,14 +438,16 @@ private void asyncResume(AsyncResponse asyncResponse) {
asyncResponse.resume(Response.ok(sink).build());
}

private volatile boolean sinkClosed = false;
volatile boolean sinkClosed = false;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this no longer private?

private volatile boolean sinkClosed = false;
volatile boolean sinkClosed = false;

private volatile AtomicInteger queueDepth = new AtomicInteger(0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be volatile

@@ -839,7 +841,7 @@ private static ProduceAction getProduceAction(
replay(produceControllerProvider, produceController);

StreamingResponseFactory streamingResponseFactory =
new StreamingResponseFactory(chunkedOutputFactory, FIVE_SECONDS_MS, FIVE_SECONDS_MS);
new StreamingResponseFactory(chunkedOutputFactory, FIVE_SECONDS_MS, FIVE_MS, DEPTH, DEPTH);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename these to be eg DURATION or GRACE_DURATION etc, need to go look up the values atm

// no third message
expect(requestsMappingIterator.hasNext()).andReturn(false);

requestsMappingIterator.close(); // call from thread executor
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do an expect with something that returns a void, need to do the playback of the calls one by one :'(

Add a comment explaining this, because this looks bonkers.

expect(mockedChunkedOutput.isClosed()).andReturn(false);
mockedChunkedOutput.write(sucessResult);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two c's in success.

Copy link
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code with @ehumber and am happy to approve. I do not think this is a sensible way to handle streamed REST produce requests which overrun the KafkaProducer request limits, but this does at least prevent heap exhaustion when it happens. I think there's a significant refactor due in this code at some point.

@ehumber
Copy link
Member Author

ehumber commented Jan 6, 2023

Sorry I didn't manage to get this tidied up in time :(.

I made a PR from a branch in confluentinc/kafka-rest, so if my original branch from my fork goes missing when I leave, hopefully you still have the code

#1098

For this PR I recommend you don't merge it until you need to. That will most probably be when you look at rate limiting around REST produce, and potentially remove it in REST and rely on backpressure from Kafka to keep the requests at a sensible limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants