Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handleCoroutineException() prevents crashing JVM threads on fatal errors #2552

Closed
yorickhenning opened this issue Feb 25, 2021 · 10 comments
Closed
Assignees
Labels

Comments

@yorickhenning
Copy link
Contributor

In a JVM, a coroutine should be able to handle a fatal error by crashing its Thread.

In the JVM default, handleCoroutineExceptionImpl() calls a ServiceLoader in the error handling execution path. That's not safe, because the call to the ServiceLoader can itself throw.

This prompted investigating coroutine crashes, which showed some Thread crashes, but not for all fatal errors. The only Thread crashes had the coroutine's fatal error hidden by the ServiceLoader call throwing during error handling - either because the first error was an OutOfMemoryError, or because the ServiceLoader’s (internally lazy) iterator crashed by reading from disk during error handling, on a Thread that’s prevented from doing I/O.

Error handling is customizable with CoroutineExceptionHandler. I thought we'd be able to avoid the ServiceLoader problem with a custom handler, and pass fatal exceptions to the Executor just by throwing. Unfortunately, if a CoroutineExceptionHandler tries to crash, handleCoroutineException() catches and starts executing the default handleCoroutineExceptionImpl()!

So, I see two issues:

  1. By falling-back-to-default on a crash, handleCoroutineException() prevents a CoroutineExceptionHandler from crashing its Thread if it would like to do that
  2. The JVM implementation of handleCoroutineExceptionImpl() can throw more exceptions when trying to handle an existing one

To fix the first issue, I'd like to propose this implementation of handleCoroutineException():

@InternalCoroutinesApi
public fun handleCoroutineException(context: CoroutineContext, exception: Throwable) {
    // Invoke an exception handler from the context if one is present. Otherwise, fall back
    // to the default Exception handler.

    val handler = context[CoroutineExceptionHandler]
    if (handler != null) {
      handler.handleException(context, exception)
    } else {
      handleCoroutineExceptionImpl(context, exception)
    }
}

This would make a CoroutineExceptionHandler a proper override.

In existing (big) JVM applications, crash handlers are configured for Threads or Executors. Letting coroutines crash would let the one set of loggers, monitoring, test instrumentation, etc. deal with any fatal errors consistently. By preventing coroutines from crashing, unhandled errors in coroutines need to be configured with their own entirely separate error handlers, in addition to any the program has at the Executor or Thread level.

@qwwdfsad qwwdfsad self-assigned this Mar 23, 2021
@qwwdfsad
Copy link
Member

qwwdfsad commented May 21, 2021

I am a bit concerned about the proposed solution.
CoroutineExceptionHandler is a user-supplied class and it may throw for any reason, e.g. a bug in the user implementation. In that case, we still want to deliver it in any available way to avoid ignored exceptions.

handleCoroutineException(context: CoroutineContext, exception: Throwable)(context: CoroutineContext, exception: Throwable) itself should be nothrow at all.
It is invoked from various parts of our infrastructure and exceptions are not expected there, meaning that throwing handleCoroutineException may deadlock an arbitrary coroutine hierarchy.

Yet the problem with throwing ServiceLoader in <clinit> that prevents any further exception handling is a serious one.
I've tried to prototype more robust handling here: #2724
Could you please tell whether such an approach fixes your concerns or there is anything else?

@yorickhenning
Copy link
Contributor Author

yorickhenning commented Jun 9, 2021

CoroutineExceptionHandler is a user-supplied class and it may throw for any reason, e.g. a bug in the user implementation.

It can, yes. That's Exceptions for you. And user-defined functions.

When a coroutine throws a fatal exception, I'd like for the error handling path not to execute statements I didn't write into the program. It's a fatal error, so the JVM's process will halt and post-mortem after my error handler executes. Given that, I'm cool with other coroutines in that JVM process halting. Best we can hope is that they're undisturbed for a post-mortem heap snapshot.

CoroutineExceptionHandler is equivalent to a catch around the body of main(). It should be able to crash the JVM process if it itself experiences a fatal error. The safest way to do that is to rethrow to the Thread.run() stack frame.

It is invoked from various parts of our infrastructure and exceptions are not expected there, meaning that throwing handleCoroutineException may deadlock an arbitrary coroutine hierarchy.

Arbitrary? Or just the coroutine that experienced an unhandled Exception, with all the expected failure to advance for any coroutine waiting on the completion of a fatally-excepted coroutine that is no longer live?

JVMs continue program execution after effectively-unrecoverable errors, for better or for (mostly) worse. Errors that inherits from VirtualMachineError.

Those errors are what I'm trying to deliver to monitoring when an expression in a coroutine abruptly completes and throws one. It's (usually) possible to get a VirtualMachineError through to monitoring before the JVM deadlocks or exits, but it requires care. It's interesting to do so when the process that forked to start the JVM process itself isn't controlled by the programmer - like on Android - but it means avoiding memory allocation, and avoiding the possibility of additional abrupt completions in statements executed by the Thread after the first abrupt completion has been caught.

Coroutines effectively wraps all coroutines in this try / catch:

try {
  myBusinessLogicCoroutine()
} catch (e: Throwable) {
  handleCoroutineExceptionImpl(e)
}

My understanding of how this works is approximate, but here goes.

When an expression abruptly completes ((JLS 15.6) and throws a VirtualMachineError, modern JVMs reliably execute the first catch {} for that error. The JVM preallocates memory for the backtrace.

If a catch {} directly re-throw {}s that first VirtualMachineError, execution resumes at the next wrapping catch {}. That's reliably repeatable.

However, if the executing catch {} for a VirtualMachineError abruptly completes an expression again, the JVM drifts into undefined behaviour: the JVM may not successfully allocate the second Error. Even if it does, and sets the prior VirtualMachineError as the cause, the JVM may not successfully resume execution at the next catch {}. Some JVMs have a margin of error for repeated failure, but at some point they'll give up.

Today, handleCoroutineExceptionImpl() is an un-overridable catch {} that can abruptly complete an expression after catching a VirtualMachineError. Its body evaluates lazy expressions and does service loading, either of which will allocate memory after having caught an OutOfMemoryError and abruptly complete a second time. That's a problem.

PR 2724 adds locking an intrinsic lock, which is (technically) a new kind of thing that can abruptly complete an expression after the first VirtualMachineError has been thrown and caught. The comment in the patch hints that it's an unsolved recursive case whenever the "last-ditch" error handler throws:

        // TODO report error from loadHandlers as well?
        val loadedHandlers = handlers ?: loadHandlers() ?: emptyList()

loadHandlers() allocates, so it may abruptly complete. The expression that may throw isn't inside a catch {}, so if the JVM successfully allocates its second Error it'll still crash the Thread and halt coroutines, it'll just have lost the first Error's context while doing so.

The best thing to do on a fatal error is to proceed to halting the JVM by crashing the thread as graciously as possible. Abstractions on top of that crash path make achieving what's still possible within a crashing-and-burning JVM significantly harder to achieve.

  1. The JVM implementation of handleCoroutineExceptionImpl() can throw more exceptions when trying to handle an existing one

It's not possible to avoid this when an error handler executes arbitrary statements. If the user-defined last-ditch error handler isn't exception-safe, the program can crash again and potentially lose state.

That's just the risk when overriding CoroutineExceptionHandler, Thread.UncaughtExceptionHandler, wrapping all tasks to Executor.execute {} in a catch {}, or whatnot.

At the moment, and with the proposed PR 2724, the Coroutines library adds exception-unsafe statements to the critical path of all errors, which makes exception-safe error handling in a program using the library much more difficult to implement.

I've tried to prototype more robust handling here: #2724
Could you please tell whether such an approach fixes your concerns or there is anything else?

It doesn't address my concerns, and I hope I've provided enough context as to why not.

I'm happy to improve the default exception handler. However, my requirement isn't a safer exception handler, it's to supply my own exception handler, and to live with the consequences. I'm trying to get from an uncaught exception in a coroutine to an exited JVM process as expeditiously as possible. I'd prefer if Coroutines didn't add catch statements that call exception-unsafe default code around all my functions, because it's quite difficult to work around.

If Coroutines tries to safely recover liveness when the JVM is in the middle of blowing up with a VirtualMachineError, coroutines gets in the way of in-process monitoring we might otherwise successfully execute before process exit.

The liveness of coroutines isn't recoverable after certain JVM errors have happened in a Thread. Even trying to recover creates an unresolved recursive problem of what-happens-if-the-last-ditch-error-handler-throws.

@qwwdfsad
Copy link
Member

Hi,

Regarding throwing service loader and class loading on the exception handling path -- I've reworked the initial solution here 906eae1.

First of all, thanks for the detailed explanation and detailed reasoning!
I understand what you are trying to achieve, but the proposed solution is even more fragile than the current approach.
Currently, it is known that the coroutine exception handler (CEH) cannot crash the thread and any exception will be handled.
It explicitly forces users to deliver such exceptions to monitoring in a robust way from CEH and not relying on the fact that the exception thrown from CEH will reach Thread.run boundary.

Coroutines effectively wraps all coroutines in this try / catch:

It's partially true. It's indeed try-catch, but it doesn't imply that throwing from the catch block is an equivalent of doing the same for the regular thread.

Coroutines are the complex framework that is written in a particular manner: lock-free state machines, plenty of guards for internal invariants, and so on. It does not expect exceptions to be thrown from arbitrary places and handlers. It indeed cannot halt an arbitrary hierarchy, but only the one that is reachable somehow -- via join, invokeOnCompletion, parent-child relationship etc., but it is not the biggest problem.

Consider your change being implemented and then the following example:

@Test
fun repro() = runBlocking<Unit> {
    val deferred = GlobalScope.async {
        GlobalScope.launch(Dispatchers.Unconfined + CoroutineExceptionHandler { _, e ->
            throw e
        }) {
            throw StackOverflowError("Application code never catches VME, so it will reach Thread.run boundary")
        }
        println("Done")
    }
    try {
        println("Awaiting")
        deferred.await()
    } catch (e: Exception) {
        println("Unexpected error during business operation: " + e)
    }
}

Even though the coroutine crashed with VirtualMachineError, the whole program didn't -- it swallowed it! Because of the internal machinery -- Unconfined coroutine has rethrown an exception to the coroutine that does not expect and arbitrary exception being thrown from one of the handlers, leading to a completely lost VirtualMachineError.
At the end of the day, rethrowing any exception from CEH may work in 90% of the cases for 90% of the users, but opens the way for much worse incidents.

Apart from that, there are other issues in a way of exceptions -- CoroutineDispatcher that also somehow handle exceptions thrown right from the coroutines Runnable's, custom politics of uncaught exception handling for various executors, and so on.

It seems like the most robust solution for your use-case is to consistently deliver (and crash, if necessary) exceptions from CoroutineExceptionHandler to the system-wide monitoring.

I suspect that users may re-define CEH for their own purposes and then miss the delivery of fatal exceptions. For that, it seems better to install system CEH via service loader that will be invoked if user-supplied CEH throws

qwwdfsad added a commit that referenced this issue Oct 12, 2021
…#2957)

* Eagerly load CoroutineExceptionHandler and load the corresponding service

Partially addresses #2552
yorickhenning pushed a commit to yorickhenning/kotlinx.coroutines that referenced this issue Oct 14, 2021
…Kotlin#2957)

* Eagerly load CoroutineExceptionHandler and load the corresponding service

Partially addresses Kotlin#2552
@qwwdfsad
Copy link
Member

qwwdfsad commented Dec 7, 2021

Considering it fixed in 1.6.0-RC.

Feel free to continue the discussion if the implemented solution is not enough

@qwwdfsad qwwdfsad closed this as completed Dec 7, 2021
@nreid260
Copy link

I'd like to propose reverting 0b65246 That change means that using coroutines triggers a read from disk (I think on whatever thread happens to create the first job).

The particular reason I care is that we have error detection checks on which threads are doing IO and these reads now trigger violations.

@dkhalanskyjb
Copy link
Collaborator

@nreid260, the only thing the change does is access a class early. If your system is unhappy with that, it probably means that it's unhappy in general with classes being loaded.

@nreid260
Copy link

Do we agree that it may change which thread does the loading?

The fact is that we're seeing violations that we didn't before. Maybe those violations just weren't being reported, and now they are, but something has changed.

I also wonder why coroutines needs to use ServiceLoader at all. It seems kind of strange to me that a foundational library like this would dynamically load code in this way. If there were a way to disable that behaviour, it would address my concerns.

@dkhalanskyjb
Copy link
Collaborator

If the moment at which a class is loaded is important to you, then the fix is to explicitly ensure it, for example, by accessing the class in a safe-for-class-loading thread before anything happens with the coroutines.

@radixdev
Copy link

radixdev commented Feb 9, 2022

@nreid260 Agreed. I filed #3180

@yorickhenning
Copy link
Contributor Author

Even though the coroutine crashed with VirtualMachineError, the whole program didn't -- it swallowed it!

I don't think it does.

In my debugging, the exception gets caught by the async {} block, which fails its returned Deferred. deferred.await() then unpacks the exception and rethrows it. The catch {} catches, and executes the expression println("Unexpected error during business operation: " + e), which prints the exception to STDOUT.

launch {} executed by adding to async {}'s call stack, because of Dispatchers.Unconfined. launch {} threw an unhandled exception. Unhandled exceptions propagate by unwinding the call stack.

That's roughly as expected in a JVM?

Because of the internal machinery -- Unconfined coroutine has rethrown an exception to the coroutine that does not expect and arbitrary exception being thrown from one of the handlers, leading to a completely lost VirtualMachineError.

To reiterate, I don't think the error gets lost.

Coroutines invoked the last-ditch error handler that the user specified for a coroutine. That last-ditch error handler threw an Exception.

Putting aside calling a ServiceLoader, calling Thread. uncaughtExceptionHandler is possibly even more troubling.

Coroutines is running as an abstraction on Executor.execute() continuations, which are running as abstractions on java.util.Thread.run().

Here's the Javadoc for Thread.UncaughtExceptionHandler:

     * Interface for handlers invoked when a {@code Thread} abruptly
     * terminates due to an uncaught exception.

The thread isn't abruptly terminating, but Coroutines is calling the handler anyway.

Here's the Javadoc for uncaughtException():

   * Method invoked when the given thread terminates due to the
   * given uncaught exception.
   *
   * <p>Any exception thrown by this method will be ignored by the
   * Java Virtual Machine.

However, an exception thrown when Coroutines calls this method isn't ignored by the Java Virtual Machine as the handler's interface specifies.

Instead, the exception propagates through the still-running Thread. That Exception will show up somewhere, likely failing an async {}. If not, probably by virtue of Unconfined or UNDISPATCHED, that exception probably still crashes the Thread? By causing the Thread.uncaughtExceptionHandler to execute to handle an exception that the Thread.uncaughtExceptionHandler threw?

Continuing to use a thread after its exception handler has executed is risky. By interface contract, uncaughtException() can do whatever it likes to thread local memory, because the thread halts afterward. Ignoring thread groups for the sake of argument.

It'd be easier and safer to have Coroutines share pool Executors with other libraries if Coroutines didn't break the contract of Thread's uncaught exception handler when a programming error or misconfiguration of a coroutine results in an unhandled error.

It'd be easier to integrate Coroutines into hybrid programs if didn't try to hide a fatal exception from either the Executor.execute() that Coroutines used to execute the coroutine continuation, or the Thread underlying.

It'd be safer to integrate Coroutines into hybrid programs if it weren't programmed so it might return a Thread to a pool Executor after having called that Thread's uncaught exception handler.

If Coroutines isn't willing to drop an exception thrown when the UncaughtExceptionHandler defined for a coroutine crashes, Coroutines should probably just crash the running thread, whatever thread that happens to be.

If that isn't Coroutines' default behaviour, it'd be nice to be able to configure to work that way.

And hopefully also so that it doesn't call a ServiceLoader at class init time.

pablobaxter pushed a commit to pablobaxter/kotlinx.coroutines that referenced this issue Sep 14, 2022
…Kotlin#2957)

* Eagerly load CoroutineExceptionHandler and load the corresponding service

Partially addresses Kotlin#2552
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants