Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompletionHandlerException caused by IllegalStateException: Mutex is not locked #2683

Closed
brewin opened this issue Apr 29, 2021 · 10 comments
Closed
Labels

Comments

@brewin
Copy link

brewin commented Apr 29, 2021

I'm getting reports of a CompletionHandlerException using Coroutines 1.4.3 and Coroutines 1.5.0-RC. I have no reports of this with Coroutines 1.4.2. I don't know whether that's just a coincidence or not. The exception occurs fairly rarely.

kotlinx.coroutines.CompletionHandlerException: 
  at kotlinx.coroutines.CancellableContinuationImpl.callOnCancellation (CancellableContinuationImpl.java:225)
  at kotlinx.coroutines.CompletedContinuation.invokeHandlers (CompletedContinuation.java:591)
  at kotlinx.coroutines.CancellableContinuationImpl.cancelCompletedResult$kotlinx_coroutines_core (CancellableContinuationImpl.java:149)
  at kotlinx.coroutines.DispatchedTask.run (DispatchedTask.java:100)
  at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely (CoroutineScheduler.java:571)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask (CoroutineScheduler.java:750)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker (CoroutineScheduler.java:678)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run (CoroutineScheduler.java:665)
Caused by: java.lang.IllegalStateException: Mutex is not locked
  at kotlinx.coroutines.sync.MutexImpl.unlock (MutexImpl.java:308)
  at kotlinx.coroutines.sync.MutexImpl$LockCont$tryResumeLockWaiter$1.invoke (MutexImpl.java:366)
  at kotlinx.coroutines.sync.MutexImpl$LockCont$tryResumeLockWaiter$1.invoke (MutexImpl.java:364)
  at kotlinx.coroutines.CancellableContinuationImpl.callOnCancellation (CancellableContinuationImpl.java:220)
  at kotlinx.coroutines.CompletedContinuation.invokeHandlers (CompletedContinuation.java:591)
  at kotlinx.coroutines.CancellableContinuationImpl.cancelCompletedResult$kotlinx_coroutines_core (CancellableContinuationImpl.java:149)
  at kotlinx.coroutines.DispatchedTask.run (DispatchedTask.java:100)
  at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely (CoroutineScheduler.java:571)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask (CoroutineScheduler.java:750)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker (CoroutineScheduler.java:678)
  at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run (CoroutineScheduler.java:665)
@dkhalanskyjb
Copy link
Collaborator

It does look like you were affected by a breaking change in 1.4.3: #1937. In particular, the change affects mutexes: before, lock either acquired a lock successfully or it did not; now, it is possible for it to acquire a lock, then observe that the coroutine was cancelled, and then call unlock. In the meantime, the mutex would be locked.

Thus, if it was observed that Mutex.lock was successfully cancelled, in 1.4.2 and earlier it would mean that the lock was not acquired by the call to lock at any point; maybe your code relies on such an assumption? Nowadays, there is an entirely different guarantee, called "prompt cancellation"; see https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines.sync/-mutex/lock.html.

CC @elizarov.

@brewin
Copy link
Author

brewin commented Apr 30, 2021

I don't use a mutex in my code, but I do use Ktor which uses one in Logging. If I understand correctly, it is unsafe to call unlock here in a finally since lock might have been cancelled and the mutex automatically unlocked?

Would it make sense to have a tryUnlock?

@dkhalanskyjb
Copy link
Collaborator

Would it make sense to have a tryUnlock?

It would not. Code should only call unlock if it knows that it "owns" the lock, which is only possible if the lock is locked. Calling unlock when someone else owns the lock is generally incorrect, and tryUnlock implies that the caller doesn't know if the lock is locked.

Looking at the code, it does look suspicious: if scope gets cancelled during feature.beginLogging(), then mutex.unlock() gets called on a non-locked mutex. I haven't worked with Ktor and am not sure if this could realistically happen. @cy6erGn0m, could you please take a look?

@cy6erGn0m
Copy link
Contributor

cy6erGn0m commented May 11, 2021

Would it make sense to have a tryUnlock?

It would not. Code should only call unlock if it knows that it "owns" the lock, which is only possible if the lock is locked. Calling unlock when someone else owns the lock is generally incorrect, and tryUnlock implies that the caller doesn't know if the lock is locked.

Looking at the code, it does look suspicious: if scope gets cancelled during feature.beginLogging(), then mutex.unlock() gets called on a non-locked mutex. I haven't worked with Ktor and am not sure if this could realistically happen. @cy6erGn0m, could you please take a look?

Thanks for pointing out to the problem. It looks like it's possible but since the stacktrace doesn't contain any ktor routines, there is a chance that the particular issue is not caused by ktor itself. Opened KTOR-2646

@brewin
Copy link
Author

brewin commented May 11, 2021

Thanks, guys. For now, I'll disable Ktor Logging and hope that prevents the crashes. I wonder if the exception could be improved to better explain what might have caused it. It is odd that the stacktrace doesn't point to my code or any library I'm using.

@brewin brewin closed this as completed May 11, 2021
@cy6erGn0m
Copy link
Contributor

The most important part of the stacktrace is:

Caused by: java.lang.IllegalStateException: Mutex is not locked
  at kotlinx.coroutines.sync.MutexImpl.unlock (MutexImpl.java:308)
  at kotlinx.coroutines.sync.MutexImpl$LockCont$tryResumeLockWaiter$1.invoke (MutexImpl.java:366)
  at kotlinx.coroutines.sync.MutexImpl$LockCont$tryResumeLockWaiter$1.invoke (MutexImpl.java:364)
  at kotlinx.coroutines.CancellableContinuationImpl.callOnCancellation (CancellableContinuationImpl.java:220)

Looking at this part I clearly see that the attempt to unlock is made from a lock continuation. So there is a suspended lock invocation that is cancelled, and its cancellation causes unlock while the lock is not yet done. So in spite of that, there is a similar bug in ktor, this particular stack trace proves that there is also a bug in the mutex implementation. This is why I am reopening this.

@cy6erGn0m cy6erGn0m reopened this May 12, 2021
@dkhalanskyjb
Copy link
Collaborator

dkhalanskyjb commented May 12, 2021

@cy6erGn0m I don't think it's a bug in the mutex. Consider the following valid interleaving that could lead to this:

  1. beginLogging is called, but doesn't manage to acquire the lock.
  2. Cancellation happens.
  3. beginLogging manages to acquire the lock. Then, it observes that cancellation happened, throws CancellationException, and plans to concurrently release the lock. (This behavior was implemented in 1.4.3, in Breaking: Get rid of atomic cancellation and provide a replacement #1937; before, if the lock was successfully acquired, the call to lock wasn't considered cancelled, didn't throw and didn't release the lock).
  4. On the Ktor's side, beginLogging throws, and the finally block is executed, releasing the lock acquired by beginLogging.
  5. At last, beginLogging releases the lock in its cancellation handler.

There is a race between steps 4 and 5, and if 4 happens earlier than 5, then 5 will throw with the stacktrace that we observe here. I don't know what happens if 5 happens earlier than 4: then 4 must throw, but we don't observe it, so maybe the exception is getting eaten somewhere.

@cy6erGn0m
Copy link
Contributor

What i am saying is that there are two bugs in both ktor and mutex.

@dkhalanskyjb
Copy link
Collaborator

Why do you think there's a bug in the mutex?

@dkhalanskyjb
Copy link
Collaborator

After an internal discussion, we concluded that this stacktrace could be caused purely by the bug in Ktor, and also that it doesn't look like there is an obvious bug in the mutex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants