New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non Blocking ReservedThreadExecutor #6495 #6496
Conversation
@gregw |
@sbordet yeah seen that, I've fixed already in my 10 branch (which uses AtomicBiInteger), but I'm not putting that forward until we know more. |
@sbordet @lorban @lachlan-roberts I've updated the safety net so that it waits 1s for the reserved thread to arrive, after which time it will abort it, so that if it ever does start then it will eventually exit. |
Never block waiting for reserved thread. Dump reserved thread stack if it does not make the rendezvous
@lorban @sbordet I don't think we will find out more about this issue just by waiting. So I propose that we merge this defensive fix, that will at least log some information if we hit the situation. |
@gregw This branch should be rebased against the latest 9.4.x branch to make sure results of the stress test are as comparable as possible, as currently building this branch generates a |
…95-ReservedThreadExecutor
@lorban done |
fix warnings
@sbordet thoughts? |
Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@gregw no idea. I tried to run the branch with @lorban perf test, but no dice, it always passed. Spinning only 100 times before giving up is too little, I got many failures that went away completely when I slept 1 second. I think we can record the max and wait double that, with a cap until we find the real reason for the failure. |
avoid waiting 1s
@sbordet not keen on blocking for 1s as that may have a significant impact on the offering thread. How about the alternative I just pushed:
|
If a reserved thread does not rendezvous, put it back on the bottom of the stack.
Even more paranoid checks: + test if an missed offer was for a thread that removed itself from the stack + test if a reserved thread is already on the stack before adding it.
No false positives in this run: https://jenkins.webtide.net/job/experiments/job/jetty-perf-9.4.x/20/consoleFull |
Even more paranoid checks: + try to poll after remove to check for race in remove
No false possitives on a longer run: https://jenkins.webtide.net/job/experiments/job/jetty-perf-9.4.x/21/consoleFull |
Even more paranoid checks: + try to poll after remove to check for race in remove
This is a new approach to #6495. A call to offer must never block, nor even yield, since to do so give an opportunity for the allocated CPU core to change, defeating the whole purpose of the class. There is also some reasonable level of diagnostic warnings if a reserved thread misses too many offers consecutively, based on tracking the state of the reserved thread.
@lorban @sbordet @lachlan-roberts I have changed this PR significantly. I think we went wrong by allowing an offer to a reserved thread to ever block or even yield. The whole reason for the reserved thread pool is to allow the EPC strategy to work, but if we block or yield the strategy thread calling tryExecute, then there is a reasonable chance that it when it continues execution it will do so on a different CPU core, thus defeating the whole point of EPC and RTE!!! The offer must never block or yield. Either there is a reserved thread there ready to go, or there is not. No waiting or spinning!!! Instead, this PR now counts consecutive misses in offer and if the reserve thread misses more than 10 times in a row we will kill it and generate a warning. The PR also keeps track of the state of the reserved thread, so it can give some better warnings if strange cases are seen (without going overboard with defensive code). Finally, I have also fixed the race between size and pending by using an I think this is a good improvement on the current RTE and we should review it for merging into the next releases as it fixes some potential issues; has a bit more diagnoistics; will never block forever; and doesn't thwart the intent of EPC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes add a lot of complexity, and I have a feeling that we should take a bit more time to try to think of a simpler solution.
For instance, I'm starting to doubt the usefulness of the pending and size counters and wonder if we could remove them entirely.
@lorban can you give me specific feedback? We really need to be careful with the size and pending count because correct operation of RTE is vital to avoid deadlocks in the nested scheduling strategies. Going over quota by just 1 thread can be fatal if there are no more threads. I don't think it is that complex keeping pending and size in a single atomic. It needs careful review, but ultimately it is not rocket science. The other important change here is to never block or yield the offering thread. That is a simplification of the current code. The extra complexity is then just in tracking some state of the reserved thread so we can make a bit better judgement what to do if we miss a rendezvous. It is not vital code as meeting the rendezvous is just an optimisation to make EPC work. If there is a bug in the state tracking code, then at worst we give ourselves some confusing diagnostic information. |
@@ -224,7 +227,7 @@ public boolean tryExecute(Runnable task) | |||
return false; | |||
} | |||
|
|||
int size = _size.decrementAndGet(); | |||
int size = _count.addAndGetLo(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can make the size
counter go negative. When this happens, the reserved threads will stop executing any task further on as ReservedThread.run()
has the following check:
if (size >= 0 && size < _capacity)
size++;
else
exit = true;
so there's no way to make the counter go positive ever again and any value < 0 is interpreted as "the pool is stopping".
// We failed to offer the task, so put this thread back on the stack in last position to give it time to arrive. | ||
if (LOG.isDebugEnabled()) | ||
LOG.debug("{} offer missed {}", this, task, ReservedThreadExecutor.this); | ||
_count.add(0, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also make size
go below 0 and trigger the "stop executing anything on the ReservedThread
s" bug.
Here is a test that reproduces the problem I spotted: public class ReservedThreadExecutorStressTest
{
private static final int THREADS = 2;
private static final int EXECUTIONS = 100;
private ExecutorService _testExecutor;
private ExecutorService _poolExecutor;
private ReservedThreadExecutor _reservedExecutor;
@AfterEach
public void after() throws Exception
{
_reservedExecutor.stop();
_poolExecutor.shutdownNow();
_testExecutor.shutdownNow();
}
@BeforeEach
public void before() throws Exception
{
_testExecutor = Executors.newFixedThreadPool(THREADS);
_poolExecutor = Executors.newCachedThreadPool();
_reservedExecutor = new ReservedThreadExecutor(_poolExecutor, 10);
_reservedExecutor.start();
}
@Test
public void testTryExecuteDoesNotStopWorking() throws Exception
{
for (int i = 0; i < 100; i++)
{
long executions = loop(EXECUTIONS);
assertThat(executions, greaterThan(0L));
}
}
private long loop(int executions) throws InterruptedException
{
AtomicLong execCount = new AtomicLong();
AtomicLong noExecCount = new AtomicLong();
CountDownLatch latch = new CountDownLatch(THREADS * executions);
for (int t = 0; t < THREADS; t++)
{
_testExecutor.submit(() ->
{
for (int i = 0; i < executions; i++)
{
if (!_reservedExecutor.tryExecute(() ->
{
execCount.incrementAndGet();
latch.countDown();
}))
{
noExecCount.incrementAndGet();
latch.countDown();
}
}
});
}
assertThat(latch.await(10, TimeUnit.SECONDS), is(true));
System.err.println("executions: " + execCount + " rejections: " + noExecCount);
return execCount.get();
}
} |
Use MAX_VALUE rather than -1 as the stopped marker value.
I've come up with a significant simplification of this class that kind of avoids the whole blocking problem entirely. See #6535. |
closing in favor of #6535. |
Defensive changes for #6495. Avoid blocking forever.