Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal ClassCastException in CoroutineDispatcher#releaseInterceptedContinuation #3773

Closed
Tmpod opened this issue Jun 5, 2023 · 11 comments
Closed

Comments

@Tmpod
Copy link

Tmpod commented Jun 5, 2023

Describe the bug

I've been getting the following exception from a third-party library using Ktor, seemingly at random:

Exception in thread "DefaultDispatcher-worker-11" kotlinx.coroutines.CoroutinesInternalError: Fatal exception in coroutines machinery for DispatchedContinuation[Dispatchers.Default, Continuation at io.ktor.client.HttpClient.execute$ktor_client_core(HttpClient.kt:191)@4b98bc73]. Please read KDoc to 'handleFatalException' method and report this incident to maintainers
    at kotlinx.coroutines.DispatchedTask.handleFatalException(DispatchedTask.kt:144)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:115)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
    Suppressed: kotlinx.coroutines.internal.DiagnosticCoroutineContextException: [CoroutineId(546), "coroutine#546":StandaloneCoroutine{Completed}@73f0a21a, Dispatchers.Default]
Caused by: java.lang.ClassCastException: class kotlin.coroutines.jvm.internal.CompletedContinuation cannot be cast to class kotlinx.coroutines.internal.DispatchedContinuation (kotlin.coroutines.jvm.internal.CompletedContinuation and kotlinx.coroutines.internal.DispatchedContinuation are in unnamed module of loader 'app')
    at kotlinx.coroutines.CoroutineDispatcher.releaseInterceptedContinuation(CoroutineDispatcher.kt:166)
    at kotlin.coroutines.jvm.internal.ContinuationImpl.releaseIntercepted(ContinuationImpl.kt:118)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:39)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
    ... 4 more

(obtained with coroutine debug mode enabled)

After looking at kx.coro's code, I found ContinuationImpl#releaseIntercepted must be called twice, thus calling CoroutineDispatcher#releaseInterceptedContinuation, which results in a ClassCastException. By that linked comment, this should only happen in case of a compiler bug.

I'm using Kotlin 1.8.20, kotlinx.coroutines 1.7.1 and Ktor 2.3.0.

Provide a Reproducer

Due to the seemingly random nature of this bug's appearances, and the fact my codebase is quite large and doesn't even interact with Ktor directly (but rather through an API wrapper), I can't spot the culprit and provide a minimally working reproducer. I've also asked around in that wrapper's support channels and nobody has gotten any similar error, so I doubt the issue stems from it.


PS: I have submitted a bug report on Kotlin's YouTrack board, because the comment said it was a compiler bug, however someone advised me to post it on this tracker as well: https://youtrack.jetbrains.com/issue/KT-59090/Fatal-exception-in-coroutines-machinery-for-DispatchedContinuation-in-Ktor-HttpClientexecute

@Tmpod Tmpod added the bug label Jun 5, 2023
@qwwdfsad
Copy link
Member

qwwdfsad commented Jun 6, 2023

Is it possible to pinpoint the exact place that happens to fail in API wrapper? Maybe an entry-point?

@Tmpod
Copy link
Author

Tmpod commented Jun 6, 2023

It seemed to be happening on REST requests at random. I will stress test it more later today and report back anything I can find.

@Tmpod
Copy link
Author

Tmpod commented Jun 6, 2023

Actually, I had a bit to test out a function that had seemed particularly problematic to me the other day.

The function is basically this:

context(Foo)
suspend fun Bar.close() {
    // these behaviours come from Kord, the third-party library
    // "user" and "channel" are fields of Bar.
    val userBehavior = UserBehavior(user, kord)  
    val channelBehavior = MessageChannelBehavior(channel, kord)
    
    // does a POST request
    userBehavior.getDmChannel().createEmbed { ... } 
    
    // does another POST request
    // "config" and "kord" come from Foo. The latter is a Kord object.
    kord.rest.channel.createMessage(config.stuff) { ... }
    
    // does a DELETE request
    channelBehaviour.delete()
    
    // does an update on a CoroutineCollection from KMongo
    // "collection" comes from Foo
    collection.updateOne(...)
}

The exception seems to be fairly consistently thrown here. And what's even more curious, is that delete doesn't seem to run (although it has run some very few times). Everything else works -- the messages are sent and the Mongo document is updated.

I tried putting a breakpoint in each of those four calls and debugging the problem, to see exactly which one made it, and it turns out it appeared after I continued on the KMongo call, and the function exited.
I tried reordering some of those calls and the exception always appeared at the end of the function.

Any suggestions on how to further debug this?

@qwwdfsad
Copy link
Member

Could you please see if the following workaround helps?
#2930 (comment)

@Tmpod
Copy link
Author

Tmpod commented Jun 23, 2023

I've tried some tests.
Running with it hasn't triggered anything, but neither has running without it... It's quite unreliable. When it starts happening it typically happens quite often in the same place, however, if I leave it and come back a while later, the issue vanishes without explanation and will likely appear somewhere else later on.
Makes it very hard to debug :/

@Tmpod
Copy link
Author

Tmpod commented Jun 23, 2023

Poked around some more and eventually got the bugger to appear again (in a completely different spot, as expected). Unfortunately, adding that property define didn't help. Both with and without Ktor's SFG, the error persists.

@qwwdfsad
Copy link
Member

qwwdfsad commented Jun 23, 2023

Thank you, we'll start looking into it more extensively.

The biggest problem here is the scope -- ClassCastException may indicate one of two things:

  • We have a bug in the library somewhere, most likely around parallel execution (thus, hard to reproduce); This one we usually can tackle, and with a stable reproducer, we typically can address the problem in a reasonable timeframe
  • Compiler bug. This one is much harder -- because staring at the code in search of a bug is not enough, the code might be perfectly correct! It's the particular call site that is miscompiled, especially with coroutines where miscompilations are quite hard to manifest: a very specific interleaving of suspend, inline, crossinline etc. might be required to reproduce the problem

I'm not sure which one it is here. Based on the anecdotical evidence (i.e. the stream of reports) I have regarding this very CCE -- if it reproduces without SFG (which is, on its own, quite an unpleasant beast), it might be the problem with context receivers -- they are preview feature and might have tricky problems.

I suggest doing the following:

  1. Get rid of the context receiver in this function and replace it with a boilerplaty code to see if it helps

  2. For the function you reported (Bar.close), it would be nice to have the following information:

  • For each line, whether it is a suspend call or not
  • For each line, whether there are any inline function
  • For each line, whether it is a suspend call or not
  • For each line, whether it mentions value class
  • For each line, whether functions calls have context receiver as well
  • For each line with lambda function, whether these lambdas are noinline/crossinline/regular.

Sorry for such a long list, though! With such information, we might be able to hand-craft the similar "reproducer" manually and just study the bytecode about potential miscompilations

@Tmpod
Copy link
Author

Tmpod commented Jun 24, 2023

Thank you for your response! Will try to get all that info as soon as I'm able to.

In the meanwhile, I forgot to mention a little detail regarding yesterday's testing: the exception, although very very similar, has changed slightly:

Exception in thread "DefaultDispatcher-worker-9" kotlinx.coroutines.CoroutinesInternalError: Fatal exception in coroutines machinery for AwaitContinuation(DispatchedContinuation[Dispatchers.Default, Continuation at io.ktor.client.engine.HttpClientEngine$DefaultImpls.executeWithinCallContext(HttpClientEngine.kt:100)@4d48f43b]){Cancelled}@ae90b22. Please read KDoc to 'handleFatalException' method and report this incident to maintainers
	at kotlinx.coroutines.DispatchedTask.handleFatalException(DispatchedTask.kt:144)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:115)
	at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
	at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
	Suppressed: kotlinx.coroutines.internal.DiagnosticCoroutineContextException: [CoroutineId(335), "coroutine#335":StandaloneCoroutine{Completed}@2dbef12b, Dispatchers.Default]
Caused by: java.lang.ClassCastException: class kotlin.coroutines.jvm.internal.CompletedContinuation cannot be cast to class kotlinx.coroutines.internal.DispatchedContinuation (kotlin.coroutines.jvm.internal.CompletedContinuation and kotlinx.coroutines.internal.DispatchedContinuation are in unnamed module of loader 'app')
	at kotlinx.coroutines.CoroutineDispatcher.releaseInterceptedContinuation(CoroutineDispatcher.kt:166)
	at kotlin.coroutines.jvm.internal.ContinuationImpl.releaseIntercepted(ContinuationImpl.kt:118)
	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:39)
	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:104)
	... 4 more

I'm unsure if this helps trace the error in any way.
This change was likely caused by testing this on a recent Kord snapshot, which uses Kotlin 1.8.21 instead of 1.8.20.

@Tmpod
Copy link
Author

Tmpod commented Jun 24, 2023

First of all, with a bit more poking around, I managed to get the bug to appear on that close function again, while also manifesting itself on the new location, quite consistently.

  1. Get rid of the context receiver in this function and replace it with a boilerplaty code to see if it helps

I hacked together a context receiver free thingy, and indeed it completely fixed the problem on both locations.

  1. For the function you reported (Bar.close), it would be nice to have the following information:

    • For each line, whether it is a suspend call or not
    • For each line, whether there are any inline function
    • For each line, whether it is a suspend call or not
    • For each line, whether it mentions value class
    • For each line, whether functions calls have context receiver as well
    • For each line with lambda function, whether these lambdas are noinline/crossinline/regular.

Here's a more detailed view of that function:

context(Foo)    
suspend fun Bar.close() {    
    val userBehavior = UserBehavior(user, kord)                  // constructor    
    val channelBehavior = MessageChannelBehavior(channel, kord)  // constructor    
        
    runCatching {    
        userBehavior.getDmChannel().createEmbed {                // suspend inline    
            // it's just setting some properties    
        }     
    }    
    
    // tries getting from cache, might do a REST call    
    val user = userBehavior.asUserOrNull()                       // suspend    
    
    config.stuff.createMessage {                                 // suspend inline;    
                                                                 // extension function, "alias" for kord.rest.channel.createMessage, also a suspend inline fun;    
                                                                 // context receiver is Foo    
    
        channelBehaviour.someExtensionFunc()                     // suspend    
        // some more property setting    
    }    
        
    channelBehaviour.delete()                                    // suspend    
        
    collection.updateByChannel(...)                              // suspend; extension function on CoroutineCollection that does some base filtering      
}

There are no noinline/ crossinline parameters nor any value classes.

As for the second location, it's a simple function that does a REST call and sets up a callback. Nothing in it gets run, the exception is always thrown right at the beginning.

@qwwdfsad
Copy link
Member

qwwdfsad commented Jun 26, 2023

Thanks!

It's definitely a compiler bug, I've checked the YT tracker, and it seems to be it: https://youtrack.jetbrains.com/issue/KT-53551
I'll close this one then, you can keep track of the fix in the YT ticket

@Tmpod
Copy link
Author

Tmpod commented Jun 26, 2023

Alright, thank you! In the meantime, I suppose I can fix this by avoiding context receivers on suspend functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants