New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple subscribeOn calls with a BoundedElasticScheduler can cause deadlock #1992
Comments
Each Having two separate Generally speaking, the goal of the |
Thanks for looking into this, I have a few more questions though: How does this differ from
Although this may not be the typical indended use case, it is problematic that is behaves differently to other schedulers. Let me provide a bit more context: this issue surfaced for us as we were passing in a thread count as an argument to a class which in turn creates the bounded scheduler. In order to debug an issue we created a unit test, which restricted the thread count to 1 to simplify analysis. The composition of our code is such that it is possible and expected to have multiple Even if the |
With I believe The Maybe we need to revisit it to not assume long-lived tasks, or more explicitly detail the risk of deadlock in the javadoc, I'm not sure yet. |
Related(?) with the same issue, my colleague @jvwilge stumbled upon the following one: public static void main(String[] args) {
testScheduler("par-1", Schedulers.newParallel("par"));
testScheduler("nbe-2-1", Schedulers.newBoundedElastic(2, 1, "nbe-2-1"));
testScheduler("nbe-1-1", Schedulers.newBoundedElastic(1, 1, "nbe-1-1"));
}
private static void testScheduler(String name, Scheduler scheduler) {
try {
Flux
.interval(Duration.ofSeconds(1), scheduler)
.doOnNext(ignored -> System.out.println(name + " emitted"))
.publishOn(scheduler)
.doOnNext(ignored -> System.out.println(name + " published"))
.blockFirst(Duration.ofSeconds(2));
System.out.println(name + " completed");
} catch (Exception error) {
System.err.println(name + " failed: " + error);
} finally {
scheduler.dispose();
System.out.println(name + " disposed");
}
}
/*
par-1 emitted
par-1 published
par-1 completed
par-1 disposed
nbe-2-1 emitted
nbe-2-1 published
nbe-2-1 completed
nbe-2-1 disposed
nbe-1-1 failed: java.lang.IllegalStateException: Timeout on blocking read for 2000 MILLISECONDS
nbe-1-1 disposed
*/ |
Each operator that takes the The consequence is that submitting a blocking task to a parallel Downside is that in the case of non-blocking tasks, we don't have re-entrancy: providing a bounded One avenue of mitigation would be to impose a minimum to |
My personal experience with Reactor BoundedElasticScheduler has been nothing but suffering[1]. I am not happy with packaging something that should have been provided by Reactor, but this single file is still way simpler, maintainable, *and* without any catches compared to the BoundedElasticScheduler. [1] reactor/reactor-core#1992
@vy I saw your disappointed rollback to your BoundedScheduledThreadPoolExecutor. I'll try to improve the situation with the BoundedElasticScheduler or see if I can come up with a different implementation... In the meantime, keep in mind that your solution is likely to fall apart with undefined behavior in case the executor is configured with more than one backing thread, as onNext serialization wouldn't be guaranteed anymore... |
Hey @simonbasle! It is very kind of you to pay such a minute attention to the feature's community implications. I deeply appreciate this, thank you. As I tried to briefly explain in reactor-pubsub README, after years of experience in RxJava and Reactor to implement reactive solutions, and in particular, of which that (unfortunately) needs to deal with blocking calls (e.g., JDBC), backpressure surfaces as a major issue for systems under load. The available unbounded scheduler solutions are the perfect disguise for this issue while your system is collapsing on production. Just introducing a simple bound on the task queue (as in Once To conclude the briefing about my frustration with
I have guessed that, but could not find any indication about it in the (java)docs. Am I missing something? |
The general idea is to abandon the facade Worker and instead always submit tasks to an executor-backed worker. In order of preference, when an operator requests a Worker: - if thread cap not reached, create and pick a new worker - else if idle workers, pick an idle worker - else pick a busy worker This implies a behavior under contention that is closer to parallel(), but with a pool that is expected to be quite larger than the typical parallel pool. The drawback is that once we get to pick a busy worker, there's no telling when its tasks (typically blocking tasks for a BoundedElasticScheduler) will finish. So even though another executor might become idle in the meantime, the operator's tasks will be pinned to the (potentially still busy) executor initially picked. To try to counter that effect a bit, we use a priority queue for the busy executors, favoring executors that are tied to less Workers (and thus less operators). We don't yet go as far as factoring in the task queue of each executor. Finally, one noticeable change is that the second int parameter in the API, maxPendingTask, is now influencing EACH executor's queue instead of being a shared counter. It should be safe in the sense that the number set with previous version in mind is bound to be over-dimensionned for the new version, but it would be recommended for users to reconsider that number.
The general idea is to abandon the facade Worker and instead always submit tasks to an executor-backed worker. In order of preference, when an operator requests a Worker: - if thread cap not reached, create and pick a new worker - else if idle workers, pick an idle worker - else pick a busy worker This implies a behavior under contention that is closer to parallel(), but with a pool that is expected to be quite larger than the typical parallel pool. The drawback is that once we get to pick a busy worker, there's no telling when its tasks (typically blocking tasks for a BoundedElasticScheduler) will finish. So even though another executor might become idle in the meantime, the operator's tasks will be pinned to the (potentially still busy) executor initially picked. To try to counter that effect a bit, we use a priority queue for the busy executors, favoring executors that are tied to less Workers (and thus less operators). We don't yet go as far as factoring in the task queue of each executor. Finally, one noticeable change is that the second int parameter in the API, maxPendingTask, is now influencing EACH executor's queue instead of being a shared counter. It should be safe in the sense that the number set with previous version in mind is bound to be over-dimensionned for the new version, but it would be recommended for users to reconsider that number. Reviewed-in: #2040
@vy I've modified the implementation of |
The following code, containing two
subscribeOn
calls never completes:The issue is unique to
BoundedElasticScheduler
. Other bounded schedulers do not show this behaviour. If onesubscribeOn
is removed, then it does run to completion.Expected Behavior
As per https://projectreactor.io/docs/core/release/reference/#_the_subscribeon_method,
Therefore the code above should be equivalent to having a single
subscribeOn
.Actual Behavior
The code never terminates. A stack dump shows it is awaiting a
CountDownLatch
Steps to Reproduce
The following unit test illustrates two bounded schedulers which do work as expected (
fromExecutor
,fromSingle
), and the one which doesn't (newBoundedElastic
)Possible Solution
Not researched in detail, however it appears that having n
subscribeOn
calls in a chain subscribing with a scheduler with n-1threadCap
triggers the issue.Your Environment
Reactor version(s) used: io.projectreactor:reactor-core:3.3.1.RELEASE
JVM version (
javar -version
):openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
OS and version (eg
uname -a
): MacOSThe text was updated successfully, but these errors were encountered: