New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aya: Implement RingBuf #294
Conversation
✅ Deploy Preview for aya-rs-docs ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site settings. |
Doing async with tokio is a little bit more verbose, but I'm doing it like this: loop {
let mut guard = ring.readable_mut().await?;
guard.get_inner_mut().process_ring(&mut |e| {
// Do things
Ok(())
})?;
guard.clear_ready();
} |
Looks like this can get stuck with epoll (miss the notification somehow). I'll investigate. |
Throwing SeqCst at the consumer pos store fixes this. unsafe { (*self.consumer_pos_ptr).store(consumer_pos, Ordering::SeqCst) }; I'm not very good at reasoning with concurrency, but it's possible that the libbpf implementation has bogus synchronization. Cilium (Go) defaults to SeqCst, so it won't see the same issue. EDIT: it might be due to tokio migrating the read function across thread too, while libbpf essentially assumes single thread polling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far.
I think that in async version of this code we should - instead of using callback - use the the tx side of an mpsc channel.
i.e:
- https://doc.rust-lang.org/std/sync/mpsc/fn.channel.html
- https://docs.rs/tokio/latest/tokio/sync/mpsc/fn.channel.html
So Sender<T>
This way:
- Ring processing speed is not bounded by the efficiency of
callback
- Ring processing is not stopped on error
- We don't need to wrap user error types to return back to the caller - so no
anyhow
dependency
Depending on the implementation we may want to use try_send()
to give errors on channel closed or buffer full
aya/Cargo.toml
Outdated
@@ -23,6 +23,7 @@ tokio = { version = "1.2.0", features = ["macros", "rt", "rt-multi-thread", "net | |||
async-std = { version = "1.9.0", optional = true } | |||
async-io = { version = "1.3", optional = true } | |||
log = "0.4" | |||
anyhow = "1.0.41" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the callback error handling was indeed pretty crude — I'll evaluate options that doesn't require boxing.
aya/src/maps/ringbuf.rs
Outdated
/// Returns when either the callback returns an Err or there's no more events. | ||
pub fn process_ring( | ||
&mut self, | ||
callback: &mut dyn FnMut(&[u8]) -> Result<(), CallbackError>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make callback
just a FnMut(&[u8])
. I don't think we ought to stop ring processing on error...
The only way to stop would be to panic in callback
, which given this isn't async is probably the best we could hope for
Thanks for your review.
Using mpsc necessitates a copy — ring processing speed isn't really the concern here, as the application has the freedom to size the ring buffer. (Also mpsc means double ring buffering — not sure that's on point.) The only other zero-copy option would be returning a guard object that updates consumer pointer on Drop — but that's not much better I'd say. |
Thanks for taking this over! |
Re: the "stuck" polling issue, it's an actual upstream flaw. The ringbuf claims to be designed around acquire-release ordering but it's actually relying on sequential consistency implicitly in the kernel space, while the userspace counterpart is simply busted. Consider the following simplified example, where the "reserve" and "commit" operation is fused and simply increments the producer counter. A notification is sent if and only if the incremented producer counter == consumer counter + 1.
With sequential consistency it's guaranteed that once p is updated from 2 to 3, c == 1 cannot be observed since the write of c == 2 precedes the read of p == 2 which must happen before the write of p = 3. However, with acq-rel there's no meaningful causal relation here at all: the relationship is only established on reads, and the read of c == 1 nor p == 2 only builds causal relationship for operations happened before the first operation in the above diagram. And what this means is that we need a SeqCst fence after both sides of the write. For the userspace, we use a SeqCst store; but the kernel space might seem doomed. Luckily, it actually does a |
A bunch of updates — documentation for |
Updated |
Suppressed a clippy lint, see code for details. |
@ishitatsuyuki codegen changes are in |
Having thought on this more, I agree - you're right. Different applications are going to have different requirements, some which can be met by sizing the ringbuf, data notification control etc.. Either way, the API as it stands seems flexible enough that I could do whatever I want. |
40969bc
to
7656849
Compare
Rebased now that the &self PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks amazing, see comments
aya/src/maps/ringbuf.rs
Outdated
|
||
/// Retrieve an event from the ring, pass it to the callback, mark it as consumed, then repeat. | ||
/// Returns when there's no more events. | ||
pub fn process_ring(&mut self, callback: &mut impl FnMut(&[u8])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered returning an iterator that consumes events instead of having a callback?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to do that, but not with an Iterator
, because then we get into the streaming iterator and self reference hell.
Return a non-Iterator handle struct that mutably borrows the RingBuf will probably work, and that can be used easily with a while loop like while let Some(entry) = ringbuf.next()
. The reader pointer is incremented on Drop
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not immediately clear to me why it would be self referential?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, it's not a self reference, but since we can't use lifetime without GATs, the entry returned from the iterator will need something like Rc
that references the main RingBuf
to properly work.
aya/src/maps/ringbuf.rs
Outdated
}) | ||
} | ||
|
||
/// Retrieve an event from the ring, pass it to the callback, mark it as consumed, then repeat. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta comment about docs: for consistency the format we use is the same as rust's libstd. The format is roughly:
First line describes briefly and in 3rd person what the function does.
Second paragraph expands on details, mentioning in discursive form the arguments and the return value.
More paragraphs might be needed to explain caveats and edge cases.
# Example
...
Sentences always end with periods. Paragraphs are separated by an empty line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some newlines. I don't write explanation for most of the arguments because most of the time they are the only argument and the meaning is obvious, but if you have any feedback around how docs is written in this PR, let me know.
Also forgot to say in the review: I think we should implement the async bits of this, like we do for perf buffers |
f32ee79
to
c16591a
Compare
On async impl: For the callback model there's not much thing we can do inside aya to support an async model (see #294 (comment)); though, if we move to an iterator-like model then it probably make sense to do that. |
@ishitatsuyuki Hi! Do you have time to address the remaining comments? If not, I would be happy to take over. |
I'll try to get a revision on this later today. |
I put up #629 as a rebase of this change and an integration test. It does not do anything towards an async API, but I agree that we should leave that for a subsequent PR. |
@ishitatsuyuki did you ever have an example of a test program that demonstrated the epoll hang? I couldn't exactly follow your arguments enough to justify the fences in code, so I went with memory accesses like in libbpf for #629. I'd love to understand when that fails and how. |
Make sure that the epoll fd is configured with edge trigger. Level trigger is why libbpf "accidentally" works (after a memory fence, it calls into a kernel function that tests the ring counter again). |
I am using |
Just realized I previously created a repro as described here. Please try it and let me know if this one gets stuck. As I described at the time of investigation a SeqCst write is definitely necessary for correctness, when:
|
As for threads, you don't need more than one producer thread to reproduce this; this "race" can happen between just one consumer and one producer. |
Your repro program was helpful, thanks for it. I was being saved on my side by using I don't have a sense of the cost of these various approaches, but in general my sense is that a complete fence before sleeping is probably more expensive. What do you think? |
I suppose that the reason the |
In general on x86 I consider any instruction with the Aside, relying on atomic release RMW to provide SeqCst semantics is technically an ill assumption (iirc AcqRel RMW is actually equivalent to SeqCst even in the spec). |
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
This patch fixes enhances the synchronization between libbpf and the producer in the kernel so that notifications cannot be lost because the producer reads a stale view of the consumer position while the consumer also reads a stale view of either the producer position or the header. The problem before this change was that nothing enforced a happens before relationship between either of the writes and the subsequent reads. The use of a sequentially consistent write ensures that the write to the consumer position is either ordered before the producer clears the busy bit, in which case the producer will see that the consumer caught up, or the write will occur after the producer has cleared the busy bit, in which case the new message will be visible. All of this is in service of using EPOLLET, which will perform fewer wakeups and generally less work. This is borne out in the benchmark data below. Note that without the atomics change, the use of EPOLLET does not work, and the benchmarks and tests show it. The below raw benchmarks are below (I've omitted the irrelevant ones for brevity). The benchmarks were run on a 32-thread AMD Ryzen 9 7950X 16-Core Processor. The summary of the results is that the producer is that in almost all cases, the benchmarks are substantially improved. The only case which seems worse is "Ringbuf sampled, reserve+commit vs output", for the "reserve" case. I guess this makes sense because the consumer piece is more expensive, and the sampled notifications mean there's not a lot of time interacting with epoll. Credit for the discovery of the bug[1] and guidance on how to fix it[2] belong to Tatsuyuki Ishi <ishitatsuyuki@gmail.com>. New: ``` Single-producer, parallel producer ================================== rb-libbpf 43.366 ± 0.277M/s (drops 0.848 ± 0.027M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 41.163 ± 0.031M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.671 ± 0.274M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 59.229 ± 0.422M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.507 ± 0.004M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.095 ± 0.016M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.091 ± 0.046M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.259 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 39.831 ± 0.122M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 51.536 ± 2.984M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 67.850 ± 1.267M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 75.257 ± 0.438M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 74.939 ± 0.295M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 81.481 ± 0.769M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 82.637 ± 0.448M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 78.142 ± 0.104M/s (drops 0.000 ± 0.000M/s) output 68.418 ± 0.032M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 30.577 ± 2.122M/s (drops 0.000 ± 0.000M/s) output-sampled 30.075 ± 1.089M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.570 ± 0.004M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 44.359 ± 0.319M/s (drops 0.091 ± 0.027M/s) rb-libbpf nr_prod 2 23.722 ± 0.024M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 14.128 ± 0.011M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 14.896 ± 0.020M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 6.056 ± 0.061M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.612 ± 0.042M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.684 ± 0.040M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 5.007 ± 0.046M/s (drops 0.001 ± 0.004M/s) rb-libbpf nr_prod 24 5.207 ± 0.093M/s (drops 0.006 ± 0.013M/s) rb-libbpf nr_prod 28 4.951 ± 0.073M/s (drops 0.030 ± 0.069M/s) rb-libbpf nr_prod 32 4.509 ± 0.069M/s (drops 0.582 ± 0.057M/s) rb-libbpf nr_prod 36 4.361 ± 0.064M/s (drops 0.733 ± 0.126M/s) rb-libbpf nr_prod 40 4.261 ± 0.049M/s (drops 0.713 ± 0.116M/s) rb-libbpf nr_prod 44 4.150 ± 0.207M/s (drops 0.841 ± 0.191M/s) rb-libbpf nr_prod 48 4.033 ± 0.064M/s (drops 1.009 ± 0.082M/s) rb-libbpf nr_prod 52 4.025 ± 0.049M/s (drops 1.012 ± 0.069M/s) ``` Old: ``` Single-producer, parallel producer ================================== rb-libbpf 20.755 ± 0.396M/s (drops 0.000 ± 0.000M/s) Single-producer, parallel producer, sampled notification ======================================================== rb-libbpf 29.347 ± 0.087M/s (drops 0.000 ± 0.000M/s) Single-producer, back-to-back mode ================================== rb-libbpf 60.791 ± 0.188M/s (drops 0.000 ± 0.000M/s) rb-libbpf-sampled 60.125 ± 0.207M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, effect of sample rate =========================================== rb-sampled-1 1.534 ± 0.006M/s (drops 0.000 ± 0.000M/s) rb-sampled-5 7.062 ± 0.029M/s (drops 0.000 ± 0.000M/s) rb-sampled-10 13.093 ± 0.107M/s (drops 0.000 ± 0.000M/s) rb-sampled-25 26.292 ± 0.118M/s (drops 0.000 ± 0.000M/s) rb-sampled-50 40.230 ± 0.030M/s (drops 0.000 ± 0.000M/s) rb-sampled-100 54.123 ± 0.334M/s (drops 0.000 ± 0.000M/s) rb-sampled-250 66.054 ± 0.282M/s (drops 0.000 ± 0.000M/s) rb-sampled-500 76.130 ± 0.648M/s (drops 0.000 ± 0.000M/s) rb-sampled-1000 80.531 ± 0.169M/s (drops 0.000 ± 0.000M/s) rb-sampled-2000 83.170 ± 0.376M/s (drops 0.000 ± 0.000M/s) rb-sampled-3000 83.702 ± 0.046M/s (drops 0.000 ± 0.000M/s) Ringbuf back-to-back, reserve+commit vs output ============================================== reserve 77.829 ± 0.178M/s (drops 0.000 ± 0.000M/s) output 67.974 ± 0.153M/s (drops 0.000 ± 0.000M/s) Ringbuf sampled, reserve+commit vs output ========================================= reserve-sampled 33.925 ± 0.101M/s (drops 0.000 ± 0.000M/s) output-sampled 30.610 ± 0.070M/s (drops 0.000 ± 0.000M/s) Single-producer, consumer/producer competing on the same CPU, low batch count ============================================================================= rb-libbpf 0.565 ± 0.002M/s (drops 0.000 ± 0.000M/s) Ringbuf, multi-producer contention ================================== rb-libbpf nr_prod 1 18.486 ± 0.067M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 2 22.009 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 3 11.908 ± 0.023M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 4 11.302 ± 0.031M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 8 5.799 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 12 4.296 ± 0.008M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 16 4.248 ± 0.005M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 20 4.530 ± 0.032M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 24 4.607 ± 0.012M/s (drops 0.000 ± 0.000M/s) rb-libbpf nr_prod 28 4.470 ± 0.017M/s (drops 0.002 ± 0.007M/s) rb-libbpf nr_prod 32 4.348 ± 0.051M/s (drops 0.703 ± 0.072M/s) rb-libbpf nr_prod 36 4.248 ± 0.062M/s (drops 0.603 ± 0.102M/s) rb-libbpf nr_prod 40 4.227 ± 0.051M/s (drops 0.805 ± 0.053M/s) rb-libbpf nr_prod 44 4.100 ± 0.049M/s (drops 0.828 ± 0.063M/s) rb-libbpf nr_prod 48 4.056 ± 0.065M/s (drops 0.922 ± 0.083M/s) rb-libbpf nr_prod 52 4.051 ± 0.053M/s (drops 0.935 ± 0.093M/s) ``` [1]: https://lore.kernel.org/bpf/CANqewP1RFzD9TWgyyZy00ZVQhQr8QjmjUgyaaNK0g0_GJse=KA@mail.gmail.com/#r [2]: aya-rs/aya#294 (comment)
@ishitatsuyuki, this pull request is now in conflict and requires a rebase. |
Seems like this PR is obsolete now that #629 got merged, no? |
Based on @willfindlay's branch, fixed the bugs I found and it appears to be working fine for my setup.
Branch based on #290.
Closes: #12
Closes: #201