New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: MPSC bubble avoidance #260
base: master
Are you sure you want to change the base?
Conversation
Have consumers look-ahead while spinning
Great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have yet to figure out some corner cases, but from what I have understood we are now loosing the total ordering of offers in multi-producer scenario. https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscCompoundQueue.java have a similar behaviour, due to different reasons: I don't think we have strong guarantees about it, but is nonetheless a change that need to be documented...
My other concern is that given that while we are searching for something "near" to be consumed in place of the proper position, we risk to loose the chance to consume it as fast as possible when it is nearly ready to be read...
Will come to better insight after a proper "digestion" of it ;)
Once a new element
My intuition is that this wouldn't be a significant effect... at this point we are already spinning anyway, and if necessary an absolute limit could easily be imposed on the lookahead distance in case the producer index strays too far. |
I don't think that relying on the element visibility (consumer side) can give meaningful info about happen before relationship between offering threads, given that such visibility is enforced by a write release..the only strong consistent operation is the producer sequence increment (using a seq const compare and swap) so that's the only order to follow imo. |
Thanks @franz1981 I agree best to let numbers do the talking re perf benefits! Re correctness I see what you mean now about weaker writes on producer side :( you're probably right and I think this has much less value if so. Maybe it could still be useful for some applications as you suggest but then it becomes a different kind of queue probably. I had event loop application in mind and strong ordering is likely important for that. Will think about the barriers some more :) |
@franz1981 after reading a bit and thinking some more I'm not yet convinced that all hope is lost. If I'm not mistaken (extremely possible), where a happens-before relationship exists between offers from different threads, this will still be preserved by the ordered stores from the consumer visibility pov due to their cumulative causality. And if I'm wrong about the above:
|
There are several factors that makes the ordering very unlikely to be preserved:
The only guarantee of putOrdered is (being enforced by the JMM to be) per-thread basis and indeed is not a sequential consistent operation. |
@franz1981 thanks for your patience!
Based on my (mis?)interpretation of http://gee.cs.oswego.edu/dl/html/j9mm.html the release fence should always happen prior to the actual write?
My understanding is that there's no guarantee for when a "subsequent" volatile read by another thread will see the write (in a seq. consistency sense), but once it does the H-B relationship is established. I don't think we're interested in the act of enqueuing to establish a barrier between producers, just want to ensure that if Doesn't transitivity of H-B relationships ensure in this case that immediately after the consumer thread volatile-reads |
This has some advantages: - poll() will always succeed after consumer reads !isEmpty() - Slots are released sooner, allowing offers to succeed for bounded queues in some cases where they would otherwise have failed - Resulting code is simpler At the expense of occasionally performing work on critical path which _might not_ have always happened on critical path before.
I understand your point, but I'm not so sure users won't implicitly relies on such total ordering between "producers". The cas (that is not just a write release) and the ordered consume ensured by the original implementation allows the producer sequence progress to be totally ordered with all the synchronization actions (according to what sequential consistency means) expressed in the program and so the offer/poll pair too. |
Tomorrow I hope to have some time to find a good example of implicit user expectations: on top of my head, now, I imagine a graph of workers using different queues to coordinate processing through them, but not sure will lead me to what I am looking for :) |
Thanks @franz1981 I really appreciate your thoughts/feedback. I'm still not quite on the same page though... it's likely due to incomplete understanding on my part but it still seems to me that all of the "meaningful" concurrency/visibility semantics should be preserved. I don't think the producer cas ordering has significance in this context since it's already a race if two producers call Imagine producer A calls To put another way - we can't care about sequential consistency w.r.t. atomic producer index increments, only w.r.t. "full" invocations of the But of course it would be great to come up with an example that disproves this, i.e. where this change introduces a race condition which wasn't there before (apart from where bounded capacity is used in a precise way for coordination ... that could obviously break down). |
I'm missing this point: the cas purpose is that it cannot succeed with the same values for both producers; it serializes the producer sequence progress. I could be wrong, so help me to understand it, but seems that your point is that given that the offering threads are not aware of who will win the cas, they don't know about such order and then the overall system doesn't care which will be the order of consumed element..makes sense? |
Ping to @nitsanw that I'm sure he has already faced and solved the dilemma of consume ordering and probably has some example about it: maybe it is a concern of mine (often happen) that just don't hold... |
@njhill I'm learning a lot bud so I'm quite happy about your opinion as well: it is always good to dismantle any assumption, to fully understand them |
Hi, @njhill and thanks for the POC code and following reasoning and discussion with @franz1981.
I think this last point is going to be very hard to solve within the API constraints. |
Also note that skipping the bubble only solves half the problem. The other being the return of control from |
@njhill I have carefully read the previous comments and I see now what you mean: sorry that it took me some time :) There is something yet that by intuition is just not right to me, but I'm sure that is due to a wrong intuition for something that is not intuitive at all... @nitsanw you're right and thanks for coming :) |
On the MPSC array queue there's a "detailed relaxed offer":
Which differentiates between full/CAS fail/success. The relaxedPoll API is not giving us a reason for failing. in theory we could return some ERR code constant and the consumers would have to do something along the lines of:
I don't think this will be very popular, but let's pretend someone wants this, they can then go down the path of meandering passed the bubble via an iterator kind of API. It sounds aweful. but I'm just illustrating that this is a bigger problem on the API front then the implementation of a bubble skipping consumer. |
With bubble skipping do you mean the full fat optimization of this PR?
We can limit the sight of the range search (32/64 elmements) while polling during bubble skipping and use some bits from the consumer index to allow size() to be able to extract from it the number of elements consumed "out of order". it won't fix the issue with the offer side, but size would work fine. |
// to the next non-CONSUMED slot, nulling all preceding slots | ||
spElement(buffer, offset, null); | ||
// If no lookahead has happened then pIndex <= cIndex + 1 and we won't enter the loop | ||
while (++cIndex < pIndex && lpElement(buffer, offset = calcElementOffset(cIndex)) == CONSUMED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to turn this long based loop into an int based one or use batch nulling (Arrays::fill? probably is worse) in 2 steps, by checking array wrapping if needeed. The former should be a better option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays::fill optimization is only going to be a win if this is a very big gap, which is very unlikely.
} | ||
else | ||
pIndex = lvProducerIndex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I would query the producer index SPIN_COUNT times, but just once.
It would harm the AtomicXXX/future VarHandle version because bounds checking will be performed each time + I would try to maintain control over the maximum stride/distance of checked elements to save the cache miss while checking back the first element (so will try to be 2 cache lines/oops size elements).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SPIN_COUNT
is the number of iterations between producer index reads, this is to attenuate producer contention without preventing us from discovering more distant produced elements that might become visible sooner.
It would harm the AtomicXXX/future VarHandle version because bounds checking will be performed each time
Please elaborate... which bounds checking? :)
I would try to maintain control over the maximum stride/distance of checked elements
It seems like this could be sensible, but feel like it may be best to determine that empirically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uuups I have misread parentheses :P
Each time the producer index change, the bound checks injected by the JVM on the tryRange loop cannot be hoisted out until the outer loop (on pollMissPath) if tryRange will be inlined, but will be performed each time.
Although it should be the "slow path", we don't know when any offered element will appear, so making such loop as fast as possible help to guarantee good latencies.
If tryRange won't be inlined due to its size I suggest to inline it manually too.
These are just some "premature" optimizations, but nonetheless seems to me plausible and easy to be achieved If don't break the algorithm
Thanks @franz1981 @nitsanw. I'm unsure how I managed to ramble on about ordering semantics for so long without realizing I was just trying to say "FIFO" :) @nitsanw that's a great summary w.r.t. the possible "gaps" and mostly aligns with my thinking.
This might be a subtlety that I missed, I don't quite follow why consumer limiting to single read of producer index would be required (presumably you mean per-
I can't see why there should be any problem with the
I expect so :) At least after the latest commit emptiness will now be correct. I had in mind to maintain a separate counter but I like @franz1981's idea of stealing some consumer index bits so that we don't need to do two stores whenever the consumer index changes. I guess it comes down to which has the least overhead/complexity.
I did not give much thought to this apart from having the same intuition that it would be quite hairy if even possible. I was hoping/expecting that many folks using bounded queues were doing so to avoid indefinite growth (e.g. form of back pressure) where the precise fullness/capacity behaviour might not be important. But like you say some API extension would be required in that case (maybe even just a constructor flag?). I will give some more thought to whether it's achievable, but expect that there may be some perf penalty if so. For unbounded queues I think this is a non-issue? (as you alluded to) Which may be good news for netty since it uses unbounded by default and I bet there's very few folks who override that.
I thought this change may have general self-contained benefit w.r.t. throughput so would be a good starting point along with similar changes to while (true) {
E e = q.relaxedPoll();
if (e != null)
// handle element
else if (q.isEmpty())
// empty
else if ((e = q.relaxedPoll()) != null)
// handle element
else
// do something else
} Though a nice API addition i.m.o. would be consumer-only |
Very happy about the current direction of this PR: just curious to get some numbers (and a proper reproducible benchmark) and I will love to port it on the XADD mpsc q :P |
@franz1981 agree, this may yet all be academic (but still interesting!) Ideas on benchmark design to expose slow producer problem welcome. Thinking large number of producers and some small artificial delay on consumer side per consumed element... though consumer needs to keep up otherwise "miss" path will rarely be taken. And what exactly to measure... I guess would be easier to use |
@njhill replying to some of the questions:
If the consumer hits a bubble after reading the producerIndex the ordering implies that at that point the producerX has incremented and is somewhere between the increment and the element store. The consumer now starts a scan after the bubble, and is suspended, the producer completes the element store and a further offer, if the consumer is unsuspended at this point and re-reads the producerIndex (without restarting the scan from the first bubble) it can end up loading the next element out of order (with respect to the producer), thus violating FIFO.
The 'fill` implementation does 2 things differently:
|
Thanks @nitsanw. Re the first point, that makes complete sense. My confusion hopefully just stems from a misinterpretation of your original comment. When you said "I think that as long as the consumer is limiting itself to a single read of the producer index...", I took it to imply that my Re the second point around
To me this appears sufficient, but maybe the above is incorrect in some way, or maybe you have another guarantee in mind?
I still don't follow this. Interleave with other producers sure (fine imo), but I don't see how the multi-pass algorithm could permit out-of-order consumption of a given producer's additions. Say a particular I'm not asserting the above with utmost confidence, just sharing my reasoning with the hope you can poke a hole in it... I know your time is valuable and it's much appreciated. |
I haven't digested the impls fully yet so maybe I'm wrong about this but just realized that maintaining a correct My original disgust at the idea was based on the prospect of the producers having to wrap past the consumer to "fill in" consumed slots, not sure whether that was also in your mind @nitsanw. But if we allow extra room beyond the queue's capacity for the producer index to advance into then it just becomes a matter of some additional accounting. WDYT? |
@njhill hi Nick! OT: if you want to take a look and give some feedback, I've started preparing some wiki for the XADD q at https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md |
Hey @franz1981... the xadd queue looks great! I'm sure I'll have more feedback once I've had more time to absorb details of the impl. And what would be the reason to not use it in netty (at least in place of the current unbounded one)? :) My feeling is that ultimately a more specialized impl would make sense for netty with some similarities to the Re this PR, I'd still be interested in your and @nitsanw's thoughts on the most recent questions. tl;dr:
|
// This is called after the current cIndex slot has been read, and moves the consumer index | ||
// to the next non-CONSUMED slot, nulling all preceding slots | ||
spElement(buffer, offset, null); | ||
// If no lookahead has happened then pIndex <= cIndex + 1 and we won't enter the loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still going to make the inlining budget for normal polls higher. The common case should be in poll. Exceptional cases in separate methods.
incrementConsumerIndex(buffer, offset, cIndex, seenPIndex); | ||
return e; | ||
} | ||
if (seenPIndex <= cIndex && (seenPIndex = lvProducerIndex()) == cIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reverts the streaming behavior of Fast Flow like algorithm to the cached variant of Lamport's algorithm (introduced by @mjpt777 ), it's a shame to lose out on that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nitsanw I don't follow, hoping you misread the logic... the only thing this changes (I think) is to possibly eliminate some (volatile) reads of the producer index by the consumer, via use of a consumer-local cached value. This simpler optimization could actually be done independently of the other changes here, though it becomes more significant when combined with the look-ahead scanning.
I.e. just changing from:
if (cIndex == lvProducerIndex()) { return null; }
to
if (cIndex >= cachedProducerIndex && cIndex == lvProducerIndex()) { return null; }
@njhill I'll do my best:
Reading through the code I'm very uneasy with how much it complicates what was a very neat method :-(. Also not sure why the tests are disabled (sorry if this is already answered). |
Thanks @nitsanw. The changes as stand were to demonstrate the approach and I had been assuming will require further profiling/refinement (such as isolating hot path for inlining per your comment). The two disabled tests fail due to bounded Thanks for those very useful suggestions re the benchmark, that's the next thing I'll focus on when I get a chance. But I'm still concerned that demonstrating performance with synthetic bubbles won't necessarily imply meaningful real-world benefit. |
@njhill thanks for the effort on this, it's an interesting direction at least intellectually :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@geektcp are you a bot..? Sorry for the maybe stupid question :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I have blocked |
@nitsanw @franz1981 would be interested in your thoughts on this experiment to make the array-based queues more progressive... I was playing around with it a while back after @belliottsmith's comments in netty/netty#9105, but got a bit tied up trying to design a benchmark to demonstrate improvement (so the P in "POC" is somewhat premature!) After seeing yesterday's exchange in the issue I thought I would run it by you anyhow to see if it's worth continuing.
The idea is to extend the consumer spin to scan all elements up to the highest-seen producer index, which is re-read periodically. Ordering guarantees are preserved by making a second pass once a candidate element is identified. A sentinel element is used to mark consumed slots beyond the current consumer index.
size()
is no longer necessarily correct, but I think that's not so important for message passing / event loop applications, and it could probably be repaired in various ways like the consumer maintaining a separate consumption count. For bounded queues capacity might temporarily appear to be reduced (down to 1 in pathological case).Other comments:
relaxedPoll()
could have similar treatment where a consumer-local count of poll failures since last success is maintained and used as the basis for (re-)checking the producer index