Bugfix for race condition in outbound flow control #59

blachris · 2019-08-01T20:46:02Z

A bidi stream call with a high throughput of messages can hang up.

This occurred reliably in a throughput test where 100000 messages of 1kB were transmitted over an in-process channel to a server and echoed back to the caller. With kroto plus, the message transfer gets simply stuck and never completes.

The root cause is a race condition in FlowControl.kt applyOutboundFlowControl(): Reading streamObserver.isReady and setting isOutboundJobRunning are not atomic, thus the outBoundJob can terminate and misses to be relaunched by an onReady-event.

Here I propose a minimally invasive fix though I also considered refactoring the coroutine to be long running - suspend while not ready and resume on ready. Please consider addressing this because the hangup is a deal breaker for my use case and most likely others as well.

marcoferrer · 2019-08-01T21:20:12Z

Thanks for the PR. Im actually interested in your second approach as well. I like the idea of not having to re launch the observer consumer

marcoferrer · 2019-08-01T21:36:39Z

Sorry about that. Didn’t mean to close this out

…o pass tests

blachris · 2019-08-02T12:07:37Z

Thanks for your interest. I added and activated the second approach in the pull request but I am struggling a bit to get the corner cases right and passing the tests.

marcoferrer · 2019-08-02T15:31:05Z

Do you have a sample project I can use to reproduce the deadlock? I wanted to see what it would take to add it as part of the integration tests. I was also wanted to test an approach using a mutex during stream consumption to resolve the issue.

blachris · 2019-08-03T17:37:31Z

Unfortunately I didn't succeed yet in creating a minimal test. I encounter the problem in a complex setup where 3 grpc calls are chained together in-process. But my theory is that the problem occurs when this flow control coroutine is frequently exiting and restarting. With a little delay in there my setup hangs almost immediately:

} finally { // FlowControl.kt line 75
  // Put a little delay/sleep/println here. 
  // If the onReady event fires between here and the next line, 
  // the coroutine exits and no new coroutine is launched.
  isOutboundJobRunning.set(false)
}

marcoferrer · 2019-08-04T04:45:32Z

Although I cant reproduce the issue by adding a delay there, I can see that it is definitely possible for a race condition to manifest there.

I've had some mild success with a solution similar to the following

private typealias MessageHandler = suspend CoroutineScope.() -> Unit

internal fun <T> CoroutineScope.applyOutboundFlowControl(
    streamObserver: CallStreamObserver<T>,
    targetChannel: Channel<T>
){

    val channelIterator = targetChannel.iterator()
    val messageHandlerBlock: MessageHandler = {
        while(
            streamObserver.isReady &&
            !targetChannel.isClosedForReceive &&
            channelIterator.hasNext()
        ){
            val value = channelIterator.next()
            streamObserver.onNext(value)
        }
        if(targetChannel.isClosedForReceive){
            streamObserver.onCompleted()
        }
    }

    val messageHandlerActor = actor<MessageHandler>(
        start = CoroutineStart.LAZY,
        capacity = Channel.CONFLATED,
        context = Dispatchers.Unconfined + CoroutineExceptionHandler { _, e ->
            streamObserver.completeSafely(e)
            targetChannel.close(e)
        }
    ) {
        consumeEach {
            if(streamObserver.isReady && !targetChannel.isClosedForReceive){
                it.invoke(this)
            }else{
                streamObserver.completeSafely()
            }
        }
    }

    streamObserver.setOnReadyHandler {
        if(targetChannel.isClosedForReceive){
            streamObserver.completeSafely()
        }else if(
            streamObserver.isReady &&
            !targetChannel.isClosedForReceive
        ){
            // Using sendBlocking here is safe since we're using a conflated channel
            messageHandlerActor.sendBlocking(messageHandlerBlock)
        }
    }
}

With this approach there are still tests that fail but it looks like those faliures are due the actor inheriting the event loop dispatcher from the runBlocking coroutine in the test. Trying out difference solutions at the moment.

… rather than test failure

blachris · 2019-08-04T11:48:37Z

Thank you for working on this issue. Since my fix attempts just seemed to make things worse I tried to reproduce the problem and I might have been successful.
In the pull request right now I added a rough unit test that sends 100k messages to a simple server, that echos the messages back. When I run :kroto-plus-coroutines:test on my machine with the 0.4.0 flow control, the test hangs up hard. With the flow control with minimally added synchronization blocks, the new test completes in under 20s but I get occasional test failures in ClientStreamingBackPressureTests > Client send suspends until server invokes receive.
What do you think about the new test?

blachris · 2019-08-05T11:38:10Z

This pull request is not necessary anymore because the issue is being addressed in #61

Chris Blacker added 3 commits August 1, 2019 22:19

Bugfix for race condition in outbound flow control

dc7b44d

Switched to minimally invasive fix

ef60e85

Removed unused imports

16dae31

marcoferrer closed this Aug 1, 2019

marcoferrer reopened this Aug 1, 2019

Re-enabled long running coroutine flow control fix, including fixes t…

12bb355

…o pass tests

Added rough unit for hangup issue, currently hangs up test completely…

c9b168c

… rather than test failure

marcoferrer mentioned this pull request Aug 4, 2019

Outbound flow control bugfix #61

Merged

blachris closed this Aug 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for race condition in outbound flow control #59

Bugfix for race condition in outbound flow control #59

blachris commented Aug 1, 2019 •

edited

marcoferrer commented Aug 1, 2019 •

edited

marcoferrer commented Aug 1, 2019

blachris commented Aug 2, 2019

marcoferrer commented Aug 2, 2019

blachris commented Aug 3, 2019

marcoferrer commented Aug 4, 2019 •

edited

blachris commented Aug 4, 2019 •

edited

blachris commented Aug 5, 2019

Bugfix for race condition in outbound flow control #59

Bugfix for race condition in outbound flow control #59

Conversation

blachris commented Aug 1, 2019 • edited

marcoferrer commented Aug 1, 2019 • edited

marcoferrer commented Aug 1, 2019

blachris commented Aug 2, 2019

marcoferrer commented Aug 2, 2019

blachris commented Aug 3, 2019

marcoferrer commented Aug 4, 2019 • edited

blachris commented Aug 4, 2019 • edited

blachris commented Aug 5, 2019

blachris commented Aug 1, 2019 •

edited

marcoferrer commented Aug 1, 2019 •

edited

marcoferrer commented Aug 4, 2019 •

edited

blachris commented Aug 4, 2019 •

edited