Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make No. of Transport Threads == Available CPUs #56488

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented May 9, 2020

We never do any file IO or other blocking work on the transport threads
so no tangible benefit can be derived from using more threads than CPUs
for IO.
There are however significant downsides to using more threads than necessary
with Netty in particular. Since we use the default setting for
io.netty.allocator.useCacheForAllThreads which is true we end up
using 16MB of thread local buffer cache for each transport thread (as Tim points out this is the upper bound/worst case scenario).
Meaning we potentially waste CPUs x 16MB of heap.

We never do any file IO or other blocking work on the transport threads
so no tangible benefit can be derived from using more threads than CPUs
for IO.
There are however significant downsides to using more threads than necessary
with Netty in particular. Since we use the default setting for
`io.netty.allocator.useCacheForAllThreads` which is `true` we end up
using `16MB` of thread local buffer cache for each transport thread.
Meaning we potentially waste 2 * CPUs * 16MB of heap across both tcp
and http transports.
@original-brownbear original-brownbear added :Distributed/Network Http and internode communication implementations team-discuss labels May 9, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label May 9, 2020
@Tim-Brooks
Copy link
Contributor

This is a discuss, so we I'm sure we will discuss it. I'm onboard with this and I will also bring back and update the shared event loops PR.

we end up using 16MB of thread local buffer cache for each transport thread.

These chunks are allocated in arenas. The number of arenas is by default the number of CPUs. The setting useCacheForAllThreads impacts caching recently used ByteBuf thread locally. Fewer threads mean that the allocations (usually 32KB for TLS or 64KB socket reads) from the Arena will have more efficient reuse in the cache. But is only related to the number of chunks allocated through a level of indirection.

@original-brownbear
Copy link
Member Author

Thanks Tim, updated the OP with your input. The 16M/thread is the upper bound (if getting really unlucky) not a set number.
Now that #46346 is merged this change is less impactful but still the same arguments apply I think.

@original-brownbear
Copy link
Member Author

We discussed this during our team meeting and since there were no objections to doing this I'm removing the discuss label.

@original-brownbear
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

Thanks Tim!

@original-brownbear original-brownbear merged commit c98ceb8 into elastic:master May 14, 2020
@original-brownbear original-brownbear deleted the halve-transport-thread-counts branch May 14, 2020 16:18
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 14, 2020
We never do any file IO or other blocking work on the transport threads
so no tangible benefit can be derived from using more threads than CPUs
for IO.
There are however significant downsides to using more threads than necessary
with Netty in particular. Since we use the default setting for
`io.netty.allocator.useCacheForAllThreads` which is `true` we end up
using up to `16MB` of thread local buffer cache for each transport thread.
Meaning we potentially waste CPUs * 16MB of heap for unnecessary IO threads in addition to obvious inefficiencies of artificially adding extra context switches.
original-brownbear added a commit that referenced this pull request May 14, 2020
We never do any file IO or other blocking work on the transport threads
so no tangible benefit can be derived from using more threads than CPUs
for IO.
There are however significant downsides to using more threads than necessary
with Netty in particular. Since we use the default setting for
`io.netty.allocator.useCacheForAllThreads` which is `true` we end up
using up to `16MB` of thread local buffer cache for each transport thread.
Meaning we potentially waste CPUs * 16MB of heap for unnecessary IO threads in addition to obvious inefficiencies of artificially adding extra context switches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >enhancement Team:Distributed Meta label for distributed team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants