Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: MPSC bubble avoidance #260

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

njhill
Copy link
Contributor

@njhill njhill commented Aug 20, 2019

@nitsanw @franz1981 would be interested in your thoughts on this experiment to make the array-based queues more progressive... I was playing around with it a while back after @belliottsmith's comments in netty/netty#9105, but got a bit tied up trying to design a benchmark to demonstrate improvement (so the P in "POC" is somewhat premature!) After seeing yesterday's exchange in the issue I thought I would run it by you anyhow to see if it's worth continuing.

The idea is to extend the consumer spin to scan all elements up to the highest-seen producer index, which is re-read periodically. Ordering guarantees are preserved by making a second pass once a candidate element is identified. A sentinel element is used to mark consumed slots beyond the current consumer index.

size() is no longer necessarily correct, but I think that's not so important for message passing / event loop applications, and it could probably be repaired in various ways like the consumer maintaining a separate consumption count. For bounded queues capacity might temporarily appear to be reduced (down to 1 in pathological case).

Other comments:

  • I didn't look in detail but assume this could be applied just as well to the unbounded / chunked / growable impls
  • relaxedPoll() could have similar treatment where a consumer-local count of poll failures since last success is maintained and used as the basis for (re-)checking the producer index
  • I haven't explored whether something like this might be applicable in the MPMC or SPMC cases, assuming it's n/a for SPSC
  • It fails one of the unit tests but I think that's because the test is based on the bounded capacity assumption which no longer holds

@coveralls
Copy link

coveralls commented Aug 20, 2019

Pull Request Test Coverage Report for Build 557

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 47.581%

Totals Coverage Status
Change from base Build 551: 0.0%
Covered Lines: 1672
Relevant Lines: 3514

💛 - Coveralls

@franz1981
Copy link
Collaborator

Great!
I will take a look at this in the next days (including today): a first comment to let you have some fair numbers if you will produce some bench results on this...the first long loop on poll should be "converted" into a counted loop tldr int based. It would avoid the additional safepoint poll on the happy path...
I'm very curious to see how you have handled the ordering issue with multiple producers vs single one

Copy link
Collaborator

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have yet to figure out some corner cases, but from what I have understood we are now loosing the total ordering of offers in multi-producer scenario. https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpscCompoundQueue.java have a similar behaviour, due to different reasons: I don't think we have strong guarantees about it, but is nonetheless a change that need to be documented...
My other concern is that given that while we are searching for something "near" to be consumed in place of the proper position, we risk to loose the chance to consume it as fast as possible when it is nearly ready to be read...
Will come to better insight after a proper "digestion" of it ;)

@njhill
Copy link
Contributor Author

njhill commented Aug 20, 2019

from what I have understood we are now loosing the total ordering of offers in multi-producer scenario

Once a new element e is found in some slot higher than the consumer index (say cIndex + 5), all prior slots (cIndex thru cIndex + 4) are re-read and if any are visible then e is discarded and the cycle repeats with the closer element and now-smaller range. If they are all (still) null then we can be sure that the cIndex + 5 producer published prior to the 0-4 producers despite having "started" its offer later. So it really does happen-before the others from pov of external offer/poll callers.

while we are searching for something "near" to be consumed in place of the proper position, we risk to loose the chance to consume it as fast as possible

My intuition is that this wouldn't be a significant effect... at this point we are already spinning anyway, and if necessary an absolute limit could easily be imposed on the lookahead distance in case the producer index strays too far.

@franz1981
Copy link
Collaborator

franz1981 commented Aug 20, 2019

I don't think that relying on the element visibility (consumer side) can give meaningful info about happen before relationship between offering threads, given that such visibility is enforced by a write release..the only strong consistent operation is the producer sequence increment (using a seq const compare and swap) so that's the only order to follow imo.
Anyway, maybe, it is not a property that we want to preserve, so it's ok.
I'm not yet sure about the effects perf-wise and I would like to see some numbers and a proper benchmark, but this is a nice work :)
Letency-wise I see good reasons it could be worse and better then a spin, so I prefer to wait for some numbers ;)

@njhill
Copy link
Contributor Author

njhill commented Aug 20, 2019

Thanks @franz1981 I agree best to let numbers do the talking re perf benefits! Re correctness I see what you mean now about weaker writes on producer side :( you're probably right and I think this has much less value if so. Maybe it could still be useful for some applications as you suggest but then it becomes a different kind of queue probably. I had event loop application in mind and strong ordering is likely important for that. Will think about the barriers some more :)

@njhill
Copy link
Contributor Author

njhill commented Aug 20, 2019

@franz1981 after reading a bit and thinking some more I'm not yet convinced that all hope is lost. If I'm not mistaken (extremely possible), where a happens-before relationship exists between offers from different threads, this will still be preserved by the ordered stores from the consumer visibility pov due to their cumulative causality.

And if I'm wrong about the above:

  • The typical producer/consumer H-B guarantees of concurrent queues still hold
  • I may have too quickly dismissed the remaining utility... I can imagine SC applications where strict ordering isn't required in which case this might still be a throughput win (especially if the consumer thread spends cycles processing between dequeued elements)
  • The second-pass logic could be removed since it doesn't buy us anything, which should further help latency a little

@franz1981
Copy link
Collaborator

franz1981 commented Aug 20, 2019

There are several factors that makes the ordering very unlikely to be preserved:

  • the write release of the element can be delayed If an offering thread is descheduled before doing it
  • the actual release barrier should (is JMM impl dependent tbh) be before the actual write element

The only guarantee of putOrdered is (being enforced by the JMM to be) per-thread basis and indeed is not a sequential consistent operation.
A relation of happen before effectively exists but not across threads if we synchronize-with (using a volatile load) such ordered store...
About the effectiveness and usability, yes I believe it can be useful: my concern is that with a limiting factor like the contended cas (that will marry quite well with thread descheduling too), I'm not sold that saving cycles on consumer side will be "the* real game changer, but I haven't yet benchmarked enough event loop applications with threads > cores and probably on that case it could make lot of difference...

@njhill
Copy link
Contributor Author

njhill commented Aug 20, 2019

@franz1981 thanks for your patience!

the actual release barrier should (is JMM impl dependent tbh) be before the actual write element

Based on my (mis?)interpretation of http://gee.cs.oswego.edu/dl/html/j9mm.html the release fence should always happen prior to the actual write?

A relation of happen before effectively exists but not across threads if we synchronize-with (using a volatile load) such store...

My understanding is that there's no guarantee for when a "subsequent" volatile read by another thread will see the write (in a seq. consistency sense), but once it does the H-B relationship is established.

I don't think we're interested in the act of enqueuing to establish a barrier between producers, just want to ensure that if T1 enqueues E1 and then establishes an independent H-B relationship with T2, subsequent enqueue of E2 by T2 is guaranteed to be returned from poll() after E1 has already been returned from poll(). If no independent H-B was established then the order of consumption is arbitrary/inconsequential anyhow.

Doesn't transitivity of H-B relationships ensure in this case that immediately after the consumer thread volatile-reads E2, E1 must be visible? (which is all we need since we won't return E2 without re-checking E1 first). We aren't relying on the ordered store of E1 for this visibility but rather the causal chain trailing from our successful read of E2.

This has some advantages:
- poll() will always succeed after consumer reads !isEmpty()
- Slots are released sooner, allowing offers to succeed for bounded
queues in some cases where they would otherwise have failed
- Resulting code is simpler

At the expense of occasionally performing work on critical path which
_might not_ have always happened on critical path before.
@franz1981
Copy link
Collaborator

franz1981 commented Aug 20, 2019

I understand your point, but I'm not so sure users won't implicitly relies on such total ordering between "producers". The cas (that is not just a write release) and the ordered consume ensured by the original implementation allows the producer sequence progress to be totally ordered with all the synchronization actions (according to what sequential consistency means) expressed in the program and so the offer/poll pair too.
The proposed policy relies only on the timing of element visibility on the slow path, forcing users that implicitly relies on such order to provide it externally, but with a much higher (synchronization/ordering) cost, because now they do not have control of the consume side if not when such operation terminate.
I cannot see an "easy" way to replicate the original ordering guarantee (user side) with this new policy and that's why I think we need to investigate how much it is important..
I haven't quoted any of your sentences because I'm on the phone now, sorry

@franz1981
Copy link
Collaborator

Tomorrow I hope to have some time to find a good example of implicit user expectations: on top of my head, now, I imagine a graph of workers using different queues to coordinate processing through them, but not sure will lead me to what I am looking for :)

@njhill
Copy link
Contributor Author

njhill commented Aug 21, 2019

Thanks @franz1981 I really appreciate your thoughts/feedback. I'm still not quite on the same page though... it's likely due to incomplete understanding on my part but it still seems to me that all of the "meaningful" concurrency/visibility semantics should be preserved.

I don't think the producer cas ordering has significance in this context since it's already a race if two producers call offer(), unless one of those calls happens-before the other (for example because they are made sequentially from the same thread or a second thread is signalled after the first offer() call returns), in which case relative ordering of the corresponding consumption will still be maintained as reasoned in prior comment.

Imagine producer A calls offer() prior to producer B, but then is immediately suspended before performing the CAS. B will overtake and its element will be consumed first. From the outside I don't think the new kind of overtaking would appear to behave differently to this in any way that matters.

To put another way - we can't care about sequential consistency w.r.t. atomic producer index increments, only w.r.t. "full" invocations of the offer() method.

But of course it would be great to come up with an example that disproves this, i.e. where this change introduces a race condition which wasn't there before (apart from where bounded capacity is used in a precise way for coordination ... that could obviously break down).

@franz1981
Copy link
Collaborator

franz1981 commented Aug 21, 2019

I'm missing this point: the cas purpose is that it cannot succeed with the same values for both producers; it serializes the producer sequence progress.
It always force an happen before relationship between 2 offering threads, by definition.
Considering that the offer operation is write releasing the element after such cas, but not atomically with it, it introduces an additional level of concurrency.
The consume operation can leverage this added concurrency, by reading elements in a different order from the cas progress and an observer can notice that.
If such order is, instead , preserved, the offer operation is really perceived as atomic.
I'm not saying that is not correct, but that such difference, given that is observable, could be something that a user can rely on.

I could be wrong, so help me to understand it, but seems that your point is that given that the offering threads are not aware of who will win the cas, they don't know about such order and then the overall system doesn't care which will be the order of consumed element..makes sense?

@franz1981
Copy link
Collaborator

franz1981 commented Aug 21, 2019

Ping to @nitsanw that I'm sure he has already faced and solved the dilemma of consume ordering and probably has some example about it: maybe it is a concern of mine (often happen) that just don't hold...

@franz1981
Copy link
Collaborator

@njhill I'm learning a lot bud so I'm quite happy about your opinion as well: it is always good to dismantle any assumption, to fully understand them

@nitsanw
Copy link
Contributor

nitsanw commented Aug 21, 2019

Hi, @njhill and thanks for the POC code and following reasoning and discussion with @franz1981.
I think the principal observation that the consumer can "skip" the bubble and maintain correct behavior is solid. The questions that fall out of leaning on it are:

  • Is the new ordering still FIFO? I tend to agree that the FIFO is on the API level, not the counter level. And so if the "bubble" is something that happens during offer the item is not really "in". I think that as long as the consumer is limiting itself to a single read of the producer index there's no way for a given producer to observe non-FIFO through offer. I'm not sure how carefully worded the javadoc is for the Supplier based methods though.
  • Can we maintain a correct size view? size is supposed to be valid from any thread. This would require some careful consideration as we now have to manage an atomic read of both the consumerIndex and whatever we use to track the bubble-skipping reads.
  • Can we maintain a correct offer? Offer should only fail if the queue is full, but if the consumer is allowed to skip a stalled producer, are producers also allowed to do this?

I think this last point is going to be very hard to solve within the API constraints.

@nitsanw
Copy link
Contributor

nitsanw commented Aug 21, 2019

Also note that skipping the bubble only solves half the problem. The other being the return of control from poll in the face of such a bubble, which the API is unable to resolve without further changes to size/isEmpty and a means to indicate no bubbles are in flight.

@franz1981
Copy link
Collaborator

franz1981 commented Aug 21, 2019

@njhill I have carefully read the previous comments and I see now what you mean: sorry that it took me some time :)

There is something yet that by intuition is just not right to me, but I'm sure that is due to a wrong intuition for something that is not intuitive at all...
Thanks for the patience to explain me :)

@nitsanw you're right and thanks for coming :)
It is a shame to not use this optimization if it proves to make difference, any ideas/opinions on it? Do you have tried something similar in your past experiments on the qs?

@nitsanw
Copy link
Contributor

nitsanw commented Aug 21, 2019

On the MPSC array queue there's a "detailed relaxed offer":

public final int failFastOffer(final E e)

Which differentiates between full/CAS fail/success. The relaxedPoll API is not giving us a reason for failing. in theory we could return some ERR code constant and the consumers would have to do something along the lines of:

E e = q.relaxedPoll();
if (e == null)
  // empty
else if (e == BUBBLE)
 // bubble
else
 // handle element

I don't think this will be very popular, but let's pretend someone wants this, they can then go down the path of meandering passed the bubble via an iterator kind of API. It sounds aweful. but I'm just illustrating that this is a bigger problem on the API front then the implementation of a bubble skipping consumer.
I think at this point I'm happy to consider bubble skipping if we can settle on a good API for it. We cannot use poll/relaxedPoll as they have a fixed relationship with offer and size, unless we can manage to keep the full package working. Maybe this is more doable for unbounded queues?

@franz1981
Copy link
Collaborator

franz1981 commented Aug 21, 2019

@nitsanw

unless we can manage to keep the full package working

With bubble skipping do you mean the full fat optimization of this PR?
Or just an API to comunicate on consumer side that a bubble has been hit (and can be implemented right now too)?
If the former, I cannot see (yet) how to handle the case where offer can fail due to a full queue that is not really full, but contains a bubble...
Or are you planning to implement producer bubble skipping too?

@njhill @nitsanw
Re

Can we maintain a correct size view? size is supposed to be valid from any thread. This would require some careful consideration as we now have to manage an atomic read of both the consumerIndex and whatever we use to track the bubble-skipping reads.

We can limit the sight of the range search (32/64 elmements) while polling during bubble skipping and use some bits from the consumer index to allow size() to be able to extract from it the number of elements consumed "out of order". it won't fix the issue with the offer side, but size would work fine.

// to the next non-CONSUMED slot, nulling all preceding slots
spElement(buffer, offset, null);
// If no lookahead has happened then pIndex <= cIndex + 1 and we won't enter the loop
while (++cIndex < pIndex && lpElement(buffer, offset = calcElementOffset(cIndex)) == CONSUMED)
Copy link
Collaborator

@franz1981 franz1981 Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to turn this long based loop into an int based one or use batch nulling (Arrays::fill? probably is worse) in 2 steps, by checking array wrapping if needeed. The former should be a better option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays::fill optimization is only going to be a win if this is a very big gap, which is very unlikely.

}
else
pIndex = lvProducerIndex();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I would query the producer index SPIN_COUNT times, but just once.
It would harm the AtomicXXX/future VarHandle version because bounds checking will be performed each time + I would try to maintain control over the maximum stride/distance of checked elements to save the cache miss while checking back the first element (so will try to be 2 cache lines/oops size elements).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SPIN_COUNT is the number of iterations between producer index reads, this is to attenuate producer contention without preventing us from discovering more distant produced elements that might become visible sooner.

It would harm the AtomicXXX/future VarHandle version because bounds checking will be performed each time

Please elaborate... which bounds checking? :)

I would try to maintain control over the maximum stride/distance of checked elements

It seems like this could be sensible, but feel like it may be best to determine that empirically.

Copy link
Collaborator

@franz1981 franz1981 Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uuups I have misread parentheses :P
Each time the producer index change, the bound checks injected by the JVM on the tryRange loop cannot be hoisted out until the outer loop (on pollMissPath) if tryRange will be inlined, but will be performed each time.
Although it should be the "slow path", we don't know when any offered element will appear, so making such loop as fast as possible help to guarantee good latencies.
If tryRange won't be inlined due to its size I suggest to inline it manually too.
These are just some "premature" optimizations, but nonetheless seems to me plausible and easy to be achieved If don't break the algorithm

@njhill
Copy link
Contributor Author

njhill commented Aug 21, 2019

Thanks @franz1981 @nitsanw. I'm unsure how I managed to ramble on about ordering semantics for so long without realizing I was just trying to say "FIFO" :)

@nitsanw that's a great summary w.r.t. the possible "gaps" and mostly aligns with my thinking.

I think that as long as the consumer is limiting itself to a single read of the producer index there's no way for a given producer to observe non-FIFO through offer.

This might be a subtlety that I missed, I don't quite follow why consumer limiting to single read of producer index would be required (presumably you mean per-poll). As currently implemented a given poll might read it 0 or more times (including > 1). Isn't the producer index (and its barriers) irrelevant to the consumer w.r.t. FIFO correctness? Don't we get everything we need from the partial order of slot writes?

I'm not sure how carefully worded the javadoc is for the Supplier based methods though.

I can't see why there should be any problem with the fill methods. For similar reasons to above I think successive Supplier.get() calls are still guaranteed to be consumed in order. The difference is that the sequence supplied by a given fill might no longer be consumed contiguously when there are concurrent producers. In fact this optimization seems like it might be especially beneficial in multi-producer scenarios involving fill ... currently there's HOL blocking where a slow Supplier will block consumption from any concurrent producers.

Can we maintain a correct size view?

I expect so :) At least after the latest commit emptiness will now be correct. I had in mind to maintain a separate counter but I like @franz1981's idea of stealing some consumer index bits so that we don't need to do two stores whenever the consumer index changes. I guess it comes down to which has the least overhead/complexity.

Can we maintain a correct offer?

I did not give much thought to this apart from having the same intuition that it would be quite hairy if even possible. I was hoping/expecting that many folks using bounded queues were doing so to avoid indefinite growth (e.g. form of back pressure) where the precise fullness/capacity behaviour might not be important. But like you say some API extension would be required in that case (maybe even just a constructor flag?). I will give some more thought to whether it's achievable, but expect that there may be some perf penalty if so.

For unbounded queues I think this is a non-issue? (as you alluded to) Which may be good news for netty since it uses unbounded by default and I bet there's very few folks who override that.

Also note that skipping the bubble only solves half the problem. The other being the return of control from poll in the face of such a bubble, which the API is unable to resolve without further changes to size/isEmpty and a means to indicate no bubbles are in flight.

I thought this change may have general self-contained benefit w.r.t. throughput so would be a good starting point along with similar changes to relaxedPoll and drain. But actually I don't think anything more is needed to circumvent the other half of the problem too: though size needs repair, isEmpty is already correct. Once the internal bubble-dodging is added to relaxedPoll, a non-blocking approach could be:

while (true) {
    E e = q.relaxedPoll();
    if (e != null)
        // handle element
    else if (q.isEmpty())
        // empty
    else if ((e = q.relaxedPoll()) != null)
        // handle element
    else
        // do something else
}

Though a nice API addition i.m.o. would be consumer-only isEmpty/size methods.

@franz1981
Copy link
Collaborator

Very happy about the current direction of this PR: just curious to get some numbers (and a proper reproducible benchmark) and I will love to port it on the XADD mpsc q :P

@njhill
Copy link
Contributor Author

njhill commented Aug 21, 2019

@franz1981 agree, this may yet all be academic (but still interesting!) Ideas on benchmark design to expose slow producer problem welcome. Thinking large number of producers and some small artificial delay on consumer side per consumed element... though consumer needs to keep up otherwise "miss" path will rarely be taken. And what exactly to measure... I guess would be easier to use fills with a slow Supplier, but that may be too contrived/special case.

@nitsanw
Copy link
Contributor

nitsanw commented Aug 22, 2019

@njhill replying to some of the questions:

why consumer limiting to single read of producer index would be required?

If the consumer hits a bubble after reading the producerIndex the ordering implies that at that point the producerX has incremented and is somewhere between the increment and the element store. The consumer now starts a scan after the bubble, and is suspended, the producer completes the element store and a further offer, if the consumer is unsuspended at this point and re-reads the producerIndex (without restarting the scan from the first bubble) it can end up loading the next element out of order (with respect to the producer), thus violating FIFO.

I can't see why there should be any problem with the fill

The 'fill` implementation does 2 things differently:

  1. Supplier::get is after the CAS, what ordering guarantees do we have here? This is different to the element being "materialized" before the offer method.
  2. It allows claiming a range in the queue, so a bubble skipping consumer can interleave with the producer and be non-FIFO.

@njhill
Copy link
Contributor Author

njhill commented Aug 22, 2019

Thanks @nitsanw. Re the first point, that makes complete sense. My confusion hopefully just stems from a misinterpretation of your original comment. When you said "I think that as long as the consumer is limiting itself to a single read of the producer index...", I took it to imply that my poll impl was incorrect as currently stands, given that it may read the prod index an unlimited number of times. But I guess it was more of a clarifying observation and hypothetical, do I have that right? The crucial part I think is the scope and your parenthetical qualification above "... without restarting the scan from the first bubble". Could you confirm that it's ok for a single poll to read the producer index multiple times providing that this is obeyed and therefore the impl as stands is solid in this regard? Sorry if I'm still off base.

Re the second point around fill, maybe it is just some differing intuition of what constitutes FIFO correctness. I'm proposing that the following guarantees hold:

  1. The queue additions from a particular fill invocation will all be consumed prior to those from any producer activity that happens-after it (or to be more explicit, happens-after it returns).
  2. The elements supplied by sequential invocations of Supplier::get within a given fill invocation will be consumed in order (relative to each other)
  3. If a Supplier::get method impl itself causes some other producer activity, then the additions from that other activity will be consumed after all elements previously supplied by that supplier within the corresponding fill invocation

To me this appears sufficient, but maybe the above is incorrect in some way, or maybe you have another guarantee in mind?

a bubble skipping consumer can interleave with the producer and be non-FIFO

I still don't follow this. Interleave with other producers sure (fine imo), but I don't see how the multi-pass algorithm could permit out-of-order consumption of a given producer's additions.

Say a particular fill invocation's supplier supplies elements 1,2,3,4,5. Their stores form a chain of h-b relationships. Then say the consumer's index points to slot 1 but it first discovers element 3 (volatile read). It will back track to slot 1 and will necessarily now observe element 1, since its store happened-before 3's, and the prior load of 3 must have happened-after its own corresponding store.

I'm not asserting the above with utmost confidence, just sharing my reasoning with the hope you can poke a hole in it... I know your time is valuable and it's much appreciated.

@njhill
Copy link
Contributor Author

njhill commented Aug 23, 2019

I haven't digested the impls fully yet so maybe I'm wrong about this but just realized that maintaining a correct offer/capacity might not be so bad after all even in the bounded case, as long as some additional memory overhead is accepted. How much is allowed could be chosen up front (corresponding to max lookahead distance).. and for the linked array variants it would be a max and not used unless needed. For the vanilla MpscArrayQueue the additional space would be a fixed overhead.

My original disgust at the idea was based on the prospect of the producers having to wrap past the consumer to "fill in" consumed slots, not sure whether that was also in your mind @nitsanw. But if we allow extra room beyond the queue's capacity for the producer index to advance into then it just becomes a matter of some additional accounting. WDYT?

@franz1981
Copy link
Collaborator

@njhill hi Nick! OT: if you want to take a look and give some feedback, I've started preparing some wiki for the XADD q at https://github.com/franz1981/JCTools-wiki/blob/xadd_docs/MpscUnboundedXaddArrayQueue.md
I would like to try it on Netty too :)
That q is a perfect candidate for the change of this pr, because differently from mpsc array q, the XADD q could spend some more time to reach the array to write the element in, making this optimization even more effective :)

@njhill
Copy link
Contributor Author

njhill commented Sep 5, 2019

Hey @franz1981... the xadd queue looks great! I'm sure I'll have more feedback once I've had more time to absorb details of the impl. And what would be the reason to not use it in netty (at least in place of the current unbounded one)? :) My feeling is that ultimately a more specialized impl would make sense for netty with some similarities to the BlockingConsumer variants, so that we don't need a separate/external CAS guard for the wakeup. Maybe that could still be xadd-based somehow though.

Re this PR, I'd still be interested in your and @nitsanw's thoughts on the most recent questions. tl;dr:

  • Are there any remaining concerns around multiple reads of producer index, or fill FIFO-ness w.r.t. the POC logic as it currently stands?
  • Might it be acceptable to trade-off increased space for these better progress guarantees, such that the backing array(s) could grow larger than the configured bounded capacity? (e.g. 2x)
  • If the answer to prior question is yes, I think we could also preserve bounded API semantics and correct size(), for both the static and linked array variants.
  • Do you have any advice on how to stimulate "stalled producer" behaviour in a benchmark such that it's actually visible in the result? Hoping it's just my lack of knowledge/imagination in this area but if we can't even reproduce the "deficiency" synthetically, what does the change buy us practically apart from a "non-blocking" claim? Especially considering increased code complexity and aforementioned space trade-off.

// This is called after the current cIndex slot has been read, and moves the consumer index
// to the next non-CONSUMED slot, nulling all preceding slots
spElement(buffer, offset, null);
// If no lookahead has happened then pIndex <= cIndex + 1 and we won't enter the loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still going to make the inlining budget for normal polls higher. The common case should be in poll. Exceptional cases in separate methods.

incrementConsumerIndex(buffer, offset, cIndex, seenPIndex);
return e;
}
if (seenPIndex <= cIndex && (seenPIndex = lvProducerIndex()) == cIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reverts the streaming behavior of Fast Flow like algorithm to the cached variant of Lamport's algorithm (introduced by @mjpt777 ), it's a shame to lose out on that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitsanw I don't follow, hoping you misread the logic... the only thing this changes (I think) is to possibly eliminate some (volatile) reads of the producer index by the consumer, via use of a consumer-local cached value. This simpler optimization could actually be done independently of the other changes here, though it becomes more significant when combined with the look-ahead scanning.

I.e. just changing from:

if (cIndex == lvProducerIndex()) { return null; }

to

if (cIndex >= cachedProducerIndex && cIndex == lvProducerIndex()) { return null; }

@nitsanw
Copy link
Contributor

nitsanw commented Sep 5, 2019

@njhill I'll do my best:

Are there any remaining concerns around multiple reads of producer index, or fill FIFO-ness w.r.t. the POC logic as it currently stands?
I re-read the tryRange. Yes that would work I think.

Might it be acceptable to trade-off increased space for these better progress guarantees, such that the backing array(s) could grow larger than the configured bounded capacity? (e.g. 2x)
I don't think that's appropriate for MpscArrayQueue

Do you have any advice on how to stimulate "stalled producer" behaviour in a benchmark such that it's actually visible in the result?
Manually in a debugger is one way. For automated testing you need to create a workload that is likely to hit this issue, which at a guess would be trying to have a consumer thread with a high priority and more producers than cores at a lower priority so that they get in each others way and trigger bubbles. Use a small queue to further limit the amount of progress they can make. A further alternative is using bytecode generation to install a delay, or have a testing version where you inject bubbles just for development purposes.

Reading through the code I'm very uneasy with how much it complicates what was a very neat method :-(. Also not sure why the tests are disabled (sorry if this is already answered).

@njhill
Copy link
Contributor Author

njhill commented Sep 5, 2019

Thanks @nitsanw. The changes as stand were to demonstrate the approach and I had been assuming will require further profiling/refinement (such as isolating hot path for inlining per your comment). The two disabled tests fail due to bounded offer not yet working (as discussed earlier), that's something else to be fixed but I at least now have a clear idea of how it could be done, albeit at the cost of some extra space. So it seems this would better suit the linked array variants rather than this one - I only picked MpscArrayQueue initially since it was simplest.

Thanks for those very useful suggestions re the benchmark, that's the next thing I'll focus on when I get a chance. But I'm still concerned that demonstrating performance with synthetic bubbles won't necessarily imply meaningful real-world benefit.

@nitsanw
Copy link
Contributor

nitsanw commented Sep 6, 2019

@njhill thanks for the effort on this, it's an interesting direction at least intellectually :-)

Copy link

@geektcp geektcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@franz1981
Copy link
Collaborator

@geektcp are you a bot..? Sorry for the maybe stupid question :)

Copy link

@geektcp geektcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link

@geektcp geektcp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nitsanw
Copy link
Contributor

nitsanw commented May 9, 2020

I have blocked geektcp from the JCTools org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants