Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non Blocking ReservedThreadExecutor #6495 #6496

Closed
wants to merge 19 commits into from

Conversation

gregw
Copy link
Contributor

@gregw gregw commented Jul 6, 2021

Defensive changes for #6495. Avoid blocking forever.

@gregw gregw linked an issue Jul 6, 2021 that may be closed by this pull request
@sbordet
Copy link
Contributor

sbordet commented Jul 6, 2021

@gregw startReservedThread() catches RejectedExecutionException but it does not decrement _pending.

@gregw
Copy link
Contributor Author

gregw commented Jul 6, 2021

@sbordet yeah seen that, I've fixed already in my 10 branch (which uses AtomicBiInteger), but I'm not putting that forward until we know more.
But will fix the pending issue in this branch

@gregw
Copy link
Contributor Author

gregw commented Jul 7, 2021

@sbordet @lorban @lachlan-roberts I've updated the safety net so that it waits 1s for the reserved thread to arrive, after which time it will abort it, so that if it ever does start then it will eventually exit.
Pretty horrid all in all.... but best I can think of unless we can see the actual problem.
I still think it must be a bug in ConcurrentLinkDeque that is allowing the same entry to be both removed and polled.

Never block waiting for reserved thread.
Dump reserved thread stack if it does not make the rendezvous
@gregw gregw marked this pull request as ready for review July 12, 2021 05:28
@gregw
Copy link
Contributor Author

gregw commented Jul 12, 2021

@lorban @sbordet I don't think we will find out more about this issue just by waiting. So I propose that we merge this defensive fix, that will at least log some information if we hit the situation.
I don't think there is a performance impact, but there could be some false positives, so would be good to test in a stress test environment. @lorban can you run this branch in your load testing and see if any warnings are produced?

@gregw gregw changed the title Thought bubble on ReservedThreadExecutor #6495 Non Blocking ReservedThreadExecutor #6495 Jul 12, 2021
@lorban
Copy link
Contributor

lorban commented Jul 12, 2021

@gregw This branch should be rebased against the latest 9.4.x branch to make sure results of the stress test are as comparable as possible, as currently building this branch generates a 9.4.43-SNAPSHOT version instead of a 9.4.44-SNAPSHOT one.

@gregw
Copy link
Contributor Author

gregw commented Jul 13, 2021

@lorban done

@gregw
Copy link
Contributor Author

gregw commented Jul 13, 2021

@sbordet thoughts?

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@sbordet
Copy link
Contributor

sbordet commented Jul 13, 2021

@gregw no idea. I tried to run the branch with @lorban perf test, but no dice, it always passed.

Spinning only 100 times before giving up is too little, I got many failures that went away completely when I slept 1 second.

I think we can record the max and wait double that, with a cap until we find the real reason for the failure.

avoid waiting 1s
@gregw
Copy link
Contributor Author

gregw commented Jul 14, 2021

@sbordet not keen on blocking for 1s as that may have a significant impact on the offering thread.

How about the alternative I just pushed:

  • we spin 1000 times trying to rendezvou
  • If we fail to rendezvou then initially we are just put back on the stack (hmmm maybe we should go onto the bottom of the stack?)
  • If we fail to rendezvou 10 times then we are taken out of action.

If a reserved thread does not rendezvous, put it back on the bottom of the stack.
@gregw
Copy link
Contributor Author

gregw commented Jul 14, 2021

@sbordet @lorban I changed to put a failed offer back onto the bottom of the stack, so it may have more time to be picked again.
@lorban can you stress test this one.
I'm going to have to adjust my working hours so we can overlap more, as I'm currently missing both morning and evening overlaps!

Even more paranoid checks:
 + test if an missed offer was for a thread that removed itself from the stack
 + test if a reserved thread is already on the stack before adding it.
@gregw
Copy link
Contributor Author

gregw commented Jul 16, 2021

Even more paranoid checks:
 + try to poll after remove to check for race in remove
@gregw
Copy link
Contributor Author

gregw commented Jul 16, 2021

No false possitives on a longer run: https://jenkins.webtide.net/job/experiments/job/jetty-perf-9.4.x/21/consoleFull

Even more paranoid checks:
 + try to poll after remove to check for race in remove
This is a new approach to #6495.    A call to offer must never block, nor even yield, since to do so give an opportunity for the allocated CPU core to change, defeating the whole purpose of the class.

There is also some reasonable level of diagnostic warnings if a reserved thread misses too many offers consecutively, based on tracking the state of the reserved thread.
@gregw
Copy link
Contributor Author

gregw commented Jul 22, 2021

@lorban @sbordet @lachlan-roberts I have changed this PR significantly.

I think we went wrong by allowing an offer to a reserved thread to ever block or even yield. The whole reason for the reserved thread pool is to allow the EPC strategy to work, but if we block or yield the strategy thread calling tryExecute, then there is a reasonable chance that it when it continues execution it will do so on a different CPU core, thus defeating the whole point of EPC and RTE!!!

The offer must never block or yield. Either there is a reserved thread there ready to go, or there is not. No waiting or spinning!!!

Instead, this PR now counts consecutive misses in offer and if the reserve thread misses more than 10 times in a row we will kill it and generate a warning. The PR also keeps track of the state of the reserved thread, so it can give some better warnings if strange cases are seen (without going overboard with defensive code).

Finally, I have also fixed the race between size and pending by using an AtomicBiInteger

I think this is a good improvement on the current RTE and we should review it for merging into the next releases as it fixes some potential issues; has a bit more diagnoistics; will never block forever; and doesn't thwart the intent of EPC.

Copy link
Contributor

@lorban lorban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes add a lot of complexity, and I have a feeling that we should take a bit more time to try to think of a simpler solution.

For instance, I'm starting to doubt the usefulness of the pending and size counters and wonder if we could remove them entirely.

@gregw
Copy link
Contributor Author

gregw commented Jul 22, 2021

@lorban can you give me specific feedback?

We really need to be careful with the size and pending count because correct operation of RTE is vital to avoid deadlocks in the nested scheduling strategies. Going over quota by just 1 thread can be fatal if there are no more threads.

I don't think it is that complex keeping pending and size in a single atomic. It needs careful review, but ultimately it is not rocket science.

The other important change here is to never block or yield the offering thread. That is a simplification of the current code. The extra complexity is then just in tracking some state of the reserved thread so we can make a bit better judgement what to do if we miss a rendezvous. It is not vital code as meeting the rendezvous is just an optimisation to make EPC work. If there is a bug in the state tracking code, then at worst we give ourselves some confusing diagnostic information.

@@ -224,7 +227,7 @@ public boolean tryExecute(Runnable task)
return false;
}

int size = _size.decrementAndGet();
int size = _count.addAndGetLo(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can make the size counter go negative. When this happens, the reserved threads will stop executing any task further on as ReservedThread.run() has the following check:

if (size >= 0 && size < _capacity)
    size++;
else
    exit = true;

so there's no way to make the counter go positive ever again and any value < 0 is interpreted as "the pool is stopping".

// We failed to offer the task, so put this thread back on the stack in last position to give it time to arrive.
if (LOG.isDebugEnabled())
LOG.debug("{} offer missed {}", this, task, ReservedThreadExecutor.this);
_count.add(0, -1);
Copy link
Contributor

@lorban lorban Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also make size go below 0 and trigger the "stop executing anything on the ReservedThreads" bug.

@lorban
Copy link
Contributor

lorban commented Jul 22, 2021

Here is a test that reproduces the problem I spotted:

public class ReservedThreadExecutorStressTest
{
    private static final int THREADS = 2;
    private static final int EXECUTIONS = 100;

    private ExecutorService _testExecutor;
    private ExecutorService _poolExecutor;
    private ReservedThreadExecutor _reservedExecutor;

    @AfterEach
    public void after() throws Exception
    {
        _reservedExecutor.stop();
        _poolExecutor.shutdownNow();
        _testExecutor.shutdownNow();
    }

    @BeforeEach
    public void before() throws Exception
    {
        _testExecutor = Executors.newFixedThreadPool(THREADS);
        _poolExecutor = Executors.newCachedThreadPool();
        _reservedExecutor = new ReservedThreadExecutor(_poolExecutor, 10);
        _reservedExecutor.start();
    }

    @Test
    public void testTryExecuteDoesNotStopWorking() throws Exception
    {
        for (int i = 0; i < 100; i++)
        {
            long executions = loop(EXECUTIONS);
            assertThat(executions, greaterThan(0L));
        }
    }

    private long loop(int executions) throws InterruptedException
    {
        AtomicLong execCount = new AtomicLong();
        AtomicLong noExecCount = new AtomicLong();
        CountDownLatch latch = new CountDownLatch(THREADS * executions);

        for (int t = 0; t < THREADS; t++)
        {
            _testExecutor.submit(() ->
            {
                for (int i = 0; i < executions; i++)
                {
                    if (!_reservedExecutor.tryExecute(() ->
                    {
                        execCount.incrementAndGet();
                        latch.countDown();
                    }))
                    {
                        noExecCount.incrementAndGet();
                        latch.countDown();
                    }
                }
            });
        }

        assertThat(latch.await(10, TimeUnit.SECONDS), is(true));
        System.err.println("executions: " + execCount + " rejections: " + noExecCount);

        return execCount.get();
    }
}

Use MAX_VALUE rather than -1 as the stopped marker value.
@gregw
Copy link
Contributor Author

gregw commented Jul 23, 2021

I've come up with a significant simplification of this class that kind of avoids the whole blocking problem entirely. See #6535.
I won't close this one yet until I get some feedback.

@gregw
Copy link
Contributor Author

gregw commented Jul 24, 2021

closing in favor of #6535.

@gregw gregw closed this Jul 24, 2021
@joakime joakime deleted the jetty-9.4.x-6495-ReservedThreadExecutor branch August 2, 2021 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ReservedThreadExecutor blocked in tryExecute
3 participants