Non Blocking ReservedThreadExecutor #6495 #6496

gregw · 2021-07-06T07:42:05Z

Defensive changes for #6495. Avoid blocking forever.

sbordet · 2021-07-06T11:47:02Z

@gregw startReservedThread() catches RejectedExecutionException but it does not decrement _pending.

gregw · 2021-07-06T13:01:56Z

@sbordet yeah seen that, I've fixed already in my 10 branch (which uses AtomicBiInteger), but I'm not putting that forward until we know more.
But will fix the pending issue in this branch

gregw · 2021-07-07T03:07:29Z

@sbordet @lorban @lachlan-roberts I've updated the safety net so that it waits 1s for the reserved thread to arrive, after which time it will abort it, so that if it ever does start then it will eventually exit.
Pretty horrid all in all.... but best I can think of unless we can see the actual problem.
I still think it must be a bug in ConcurrentLinkDeque that is allowing the same entry to be both removed and polled.

Never block waiting for reserved thread. Dump reserved thread stack if it does not make the rendezvous

gregw · 2021-07-12T05:30:16Z

@lorban @sbordet I don't think we will find out more about this issue just by waiting. So I propose that we merge this defensive fix, that will at least log some information if we hit the situation.
I don't think there is a performance impact, but there could be some false positives, so would be good to test in a stress test environment. @lorban can you run this branch in your load testing and see if any warnings are produced?

lorban · 2021-07-12T09:32:45Z

@gregw This branch should be rebased against the latest 9.4.x branch to make sure results of the stress test are as comparable as possible, as currently building this branch generates a 9.4.43-SNAPSHOT version instead of a 9.4.44-SNAPSHOT one.

…95-ReservedThreadExecutor

gregw · 2021-07-13T00:49:30Z

@lorban done

fix warnings

gregw · 2021-07-13T02:53:23Z

@sbordet thoughts?

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet · 2021-07-13T20:41:23Z

@gregw no idea. I tried to run the branch with @lorban perf test, but no dice, it always passed.

Spinning only 100 times before giving up is too little, I got many failures that went away completely when I slept 1 second.

I think we can record the max and wait double that, with a cap until we find the real reason for the failure.

avoid waiting 1s

gregw · 2021-07-14T00:49:16Z

@sbordet not keen on blocking for 1s as that may have a significant impact on the offering thread.

How about the alternative I just pushed:

we spin 1000 times trying to rendezvou
If we fail to rendezvou then initially we are just put back on the stack (hmmm maybe we should go onto the bottom of the stack?)
If we fail to rendezvou 10 times then we are taken out of action.

If a reserved thread does not rendezvous, put it back on the bottom of the stack.

gregw · 2021-07-14T00:52:22Z

@sbordet @lorban I changed to put a failed offer back onto the bottom of the stack, so it may have more time to be picked again.
@lorban can you stress test this one.
I'm going to have to adjust my working hours so we can overlap more, as I'm currently missing both morning and evening overlaps!

Even more paranoid checks: + test if an missed offer was for a thread that removed itself from the stack + test if a reserved thread is already on the stack before adding it.

gregw · 2021-07-16T05:28:49Z

No false positives in this run: https://jenkins.webtide.net/job/experiments/job/jetty-perf-9.4.x/20/consoleFull

Even more paranoid checks: + try to poll after remove to check for race in remove

gregw · 2021-07-16T06:30:45Z

No false possitives on a longer run: https://jenkins.webtide.net/job/experiments/job/jetty-perf-9.4.x/21/consoleFull

Even more paranoid checks: + try to poll after remove to check for race in remove

This is a new approach to #6495. A call to offer must never block, nor even yield, since to do so give an opportunity for the allocated CPU core to change, defeating the whole purpose of the class. There is also some reasonable level of diagnostic warnings if a reserved thread misses too many offers consecutively, based on tracking the state of the reserved thread.

gregw · 2021-07-22T03:02:39Z

@lorban @sbordet @lachlan-roberts I have changed this PR significantly.

I think we went wrong by allowing an offer to a reserved thread to ever block or even yield. The whole reason for the reserved thread pool is to allow the EPC strategy to work, but if we block or yield the strategy thread calling tryExecute, then there is a reasonable chance that it when it continues execution it will do so on a different CPU core, thus defeating the whole point of EPC and RTE!!!

The offer must never block or yield. Either there is a reserved thread there ready to go, or there is not. No waiting or spinning!!!

Instead, this PR now counts consecutive misses in offer and if the reserve thread misses more than 10 times in a row we will kill it and generate a warning. The PR also keeps track of the state of the reserved thread, so it can give some better warnings if strange cases are seen (without going overboard with defensive code).

Finally, I have also fixed the race between size and pending by using an AtomicBiInteger

I think this is a good improvement on the current RTE and we should review it for merging into the next releases as it fixes some potential issues; has a bit more diagnoistics; will never block forever; and doesn't thwart the intent of EPC.

lorban

These changes add a lot of complexity, and I have a feeling that we should take a bit more time to try to think of a simpler solution.

For instance, I'm starting to doubt the usefulness of the pending and size counters and wonder if we could remove them entirely.

gregw · 2021-07-22T08:32:52Z

@lorban can you give me specific feedback?

We really need to be careful with the size and pending count because correct operation of RTE is vital to avoid deadlocks in the nested scheduling strategies. Going over quota by just 1 thread can be fatal if there are no more threads.

I don't think it is that complex keeping pending and size in a single atomic. It needs careful review, but ultimately it is not rocket science.

The other important change here is to never block or yield the offering thread. That is a simplification of the current code. The extra complexity is then just in tracking some state of the reserved thread so we can make a bit better judgement what to do if we miss a rendezvous. It is not vital code as meeting the rendezvous is just an optimisation to make EPC work. If there is a bug in the state tracking code, then at worst we give ourselves some confusing diagnostic information.

lorban · 2021-07-22T09:23:44Z

jetty-util/src/main/java/org/eclipse/jetty/util/thread/ReservedThreadExecutor.java

@@ -224,7 +227,7 @@ public boolean tryExecute(Runnable task)
            return false;
        }

-        int size = _size.decrementAndGet();
+        int size = _count.addAndGetLo(-1);


This can make the size counter go negative. When this happens, the reserved threads will stop executing any task further on as ReservedThread.run() has the following check:

if (size >= 0 && size < _capacity) size++; else exit = true;

so there's no way to make the counter go positive ever again and any value < 0 is interpreted as "the pool is stopping".

lorban · 2021-07-22T09:38:24Z

jetty-util/src/main/java/org/eclipse/jetty/util/thread/ReservedThreadExecutor.java

+            // We failed to offer the task, so put this thread back on the stack in last position to give it time to arrive.
+            if (LOG.isDebugEnabled())
+                LOG.debug("{} offer missed {}", this, task, ReservedThreadExecutor.this);
+            _count.add(0, -1);


This can also make size go below 0 and trigger the "stop executing anything on the ReservedThreads" bug.

lorban · 2021-07-22T10:09:51Z

Here is a test that reproduces the problem I spotted:

public class ReservedThreadExecutorStressTest
{
    private static final int THREADS = 2;
    private static final int EXECUTIONS = 100;

    private ExecutorService _testExecutor;
    private ExecutorService _poolExecutor;
    private ReservedThreadExecutor _reservedExecutor;

    @AfterEach
    public void after() throws Exception
    {
        _reservedExecutor.stop();
        _poolExecutor.shutdownNow();
        _testExecutor.shutdownNow();
    }

    @BeforeEach
    public void before() throws Exception
    {
        _testExecutor = Executors.newFixedThreadPool(THREADS);
        _poolExecutor = Executors.newCachedThreadPool();
        _reservedExecutor = new ReservedThreadExecutor(_poolExecutor, 10);
        _reservedExecutor.start();
    }

    @Test
    public void testTryExecuteDoesNotStopWorking() throws Exception
    {
        for (int i = 0; i < 100; i++)
        {
            long executions = loop(EXECUTIONS);
            assertThat(executions, greaterThan(0L));
        }
    }

    private long loop(int executions) throws InterruptedException
    {
        AtomicLong execCount = new AtomicLong();
        AtomicLong noExecCount = new AtomicLong();
        CountDownLatch latch = new CountDownLatch(THREADS * executions);

        for (int t = 0; t < THREADS; t++)
        {
            _testExecutor.submit(() ->
            {
                for (int i = 0; i < executions; i++)
                {
                    if (!_reservedExecutor.tryExecute(() ->
                    {
                        execCount.incrementAndGet();
                        latch.countDown();
                    }))
                    {
                        noExecCount.incrementAndGet();
                        latch.countDown();
                    }
                }
            });
        }

        assertThat(latch.await(10, TimeUnit.SECONDS), is(true));
        System.err.println("executions: " + execCount + " rejections: " + noExecCount);

        return execCount.get();
    }
}

Use MAX_VALUE rather than -1 as the stopped marker value.

gregw · 2021-07-23T05:31:16Z

I've come up with a significant simplification of this class that kind of avoids the whole blocking problem entirely. See #6535.
I won't close this one yet until I get some feedback.

gregw · 2021-07-24T01:41:59Z

closing in favor of #6535.

Thought bubble on #6495

7e92507

gregw requested review from sbordet, lorban and lachlan-roberts July 6, 2021 07:42

gregw linked an issue Jul 6, 2021 that may be closed by this pull request

ReservedThreadExecutor blocked in tryExecute #6495

Closed

gregw added 2 commits July 6, 2021 23:02

Thought bubble on #6495

3ce1e11

Thought bubble on #6495

5cd35ec

gregw added 3 commits July 12, 2021 13:44

Thought bubble on #6495

25f0e6f

Thought bubble on #6495

0f6f5ae

Defensive "fix" for #6495

2e005c0

Never block waiting for reserved thread. Dump reserved thread stack if it does not make the rendezvous

gregw marked this pull request as ready for review July 12, 2021 05:28

gregw changed the title ~~Thought bubble on ReservedThreadExecutor #6495~~ Non Blocking ReservedThreadExecutor #6495 Jul 12, 2021

Merge remote-tracking branch 'origin/jetty-9.4.x' into jetty-9.4.x-64…

49ebdcf

…95-ReservedThreadExecutor

Defensive "fix" for #6495

5ff14cf

fix warnings

sbordet added 4 commits July 13, 2021 16:20

Additional logging to try to catch the issue.

e253e19

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

Additional logging to try to catch the issue, take 2.

51dcf9d

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

Additional logging to try to catch the issue, take 3.

508e37f

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

Additional logging to try to catch the issue, take 3.

8ee82e9

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

Defensive "fix" for #6495

5c91535

avoid waiting 1s

Defensive "fix" for #6495

29959d2

If a reserved thread does not rendezvous, put it back on the bottom of the stack.

Defensive "fix" for #6495

bbc254a

Even more paranoid checks: + test if an missed offer was for a thread that removed itself from the stack + test if a reserved thread is already on the stack before adding it.

Defensive "fix" for #6495

6348908

Even more paranoid checks: + try to poll after remove to check for race in remove

gregw mentioned this pull request Jul 17, 2021

ReservedThreadExecutor blocked in tryExecute #6495

Closed

gregw added 2 commits July 19, 2021 12:52

Defensive "fix" for #6495

c286da0

Even more paranoid checks: + try to poll after remove to check for race in remove

gregw mentioned this pull request Jul 22, 2021

ReservedThreadExecutor ReservedThread.offer should never block indefinitely #6530

Closed

lorban requested changes Jul 22, 2021

View reviewed changes

lorban reviewed Jul 22, 2021

View reviewed changes

Defensive "fix" for #6495

06de795

Use MAX_VALUE rather than -1 as the stopped marker value.

gregw mentioned this pull request Jul 23, 2021

Non blocking ReservedThreadExecutor #6535

Merged

gregw closed this Jul 24, 2021

joakime deleted the jetty-9.4.x-6495-ReservedThreadExecutor branch August 2, 2021 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non Blocking ReservedThreadExecutor #6495 #6496

Non Blocking ReservedThreadExecutor #6495 #6496

gregw commented Jul 6, 2021 •

edited

sbordet commented Jul 6, 2021

gregw commented Jul 6, 2021

gregw commented Jul 7, 2021

gregw commented Jul 12, 2021

lorban commented Jul 12, 2021 •

edited

gregw commented Jul 13, 2021

gregw commented Jul 13, 2021

sbordet commented Jul 13, 2021

gregw commented Jul 14, 2021

gregw commented Jul 14, 2021

gregw commented Jul 16, 2021

gregw commented Jul 16, 2021

gregw commented Jul 22, 2021

lorban left a comment

gregw commented Jul 22, 2021

lorban Jul 22, 2021

lorban Jul 22, 2021 •

edited

lorban commented Jul 22, 2021 •

edited

gregw commented Jul 23, 2021

gregw commented Jul 24, 2021

Non Blocking ReservedThreadExecutor #6495 #6496

Non Blocking ReservedThreadExecutor #6495 #6496

Conversation

gregw commented Jul 6, 2021 • edited

sbordet commented Jul 6, 2021

gregw commented Jul 6, 2021

gregw commented Jul 7, 2021

gregw commented Jul 12, 2021

lorban commented Jul 12, 2021 • edited

gregw commented Jul 13, 2021

gregw commented Jul 13, 2021

sbordet commented Jul 13, 2021

gregw commented Jul 14, 2021

gregw commented Jul 14, 2021

gregw commented Jul 16, 2021

gregw commented Jul 16, 2021

gregw commented Jul 22, 2021

lorban left a comment

Choose a reason for hiding this comment

gregw commented Jul 22, 2021

lorban Jul 22, 2021

Choose a reason for hiding this comment

lorban Jul 22, 2021 • edited

Choose a reason for hiding this comment

lorban commented Jul 22, 2021 • edited

gregw commented Jul 23, 2021

gregw commented Jul 24, 2021

gregw commented Jul 6, 2021 •

edited

lorban commented Jul 12, 2021 •

edited

lorban Jul 22, 2021 •

edited

lorban commented Jul 22, 2021 •

edited