-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize mutexes #257
Optimize mutexes #257
Conversation
I get slightly worried about the longer times that the mutexes are locked, and that it's no longer possible to read in parallel. There are a few internal readers that need reasonably low latency, such as the Volume and Loudness filters. With this change, polling statistics via the websocket has a larger chance of disturbing the processing. It may be ok, but it needs thorough testing.
Did you measure? I have just assumed that the unlocking/locking is so cheap that the total cost at the rate I'm unlocking/locking is negligible.
|
src/bin.rs
Outdated
err | ||
); | ||
{ | ||
let mut next_cfg = next_config.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, then this won't work. This lock must be kept free so that it's possible to update the config via the websocket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, pointy hat to me.
src/alsadevice.rs
Outdated
.unwrap_or(()); | ||
break; | ||
let mut chunk = { | ||
let mut capture_status = params.capture_status.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will keep the lock all the way until line 803. There are a few operations along the way that aren't guaranteed to be quick. Especially problematic is pushing messages to channels, since that may block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the special case of those channels blocking I had indeed not considered yet. I did trigger some end of streams and such but I can't vouch for the coverage of such tests.
Of course we could send off those sends to their own thread but at that's getting silly.
The lines of code don't worry me so much, in fact grouping together all mutex operations together is its purpose, and there are a fair bit of arms that are executed only for specific code paths. But indeed whatever the case, the lock must be dropped quickly.
w.r.t. my main comment I will diagnose to understand which part of my changes caused the performance hit. It seems counter-intuitive but here I am 馃槃
I agree that it needs thorough testing. I am positive that there are gains to be had here, but don't worry -- if this turns out to be a dud then I have no problems shelving it. The fundamental idea is here that acquiring an Taking the Alsa playback thread as an example, it takes at least three locking operations before sending audio through the channel (in the case of a normally running playback device), more depending on the averages. For a single mutex lock with a latency cost of "1" in the uncontended case, the latency will be at least 1 * 2 (twice as expensive) * 3 (number of locks) = 6. You are right that in the contended case the latency is not "1" but "1 + queue_time" as we are waiting for the lock. So for the case to be positive, the average queue time must be lower than the penalty incurred by a more complex I hope you don't perceive this as mansplaining, which is not my intention at all. Just trying to convey my thoughts.
It's not negligible as for example this slide deck shows: https://www.slideshare.net/mitsunorikomatsu/performance-comparison-of-mutex-rwlock-and-atomic-types-in-rust. But at the same time this is also where I am starting to doubt myself. I have a Radxa Zero (think a Raspberry Pi Zero) as a lightweight SBC. It's useful to gauge performance exactly because it's not fast -- watching As I am typing this I retook my measurements and guess what... the current state of this PR is that it has higher cpu usage. Time to rethink this.
Yes, I can imagine. In the slides above you can see how much faster an atomic operation would be than a mutex lock. For (There is no such thing as an
If and when we feel this is ripe enough we could consider soliciting assistance from the diyAudio community. |
To improve parallelism, I'm slowly working on a change that's finer-grained again and based on If it works out, I will also look into transforming the I think that Merry Christmas everyone! |
Working on this I notice that the cpal backend always uses the number of channels that it was opened with, instead of |
That's a mistake! It's supposed to be implemented in all backends (and I thought it already was). |
I can add it in. Please ignore these broken builds as I鈥檓 working my way to a second iteration of this PR. I鈥檒l signal when I鈥檝e got something presentable. |
2f93ed3
to
a5e530f
Compare
531e01e
to
15dc50b
Compare
This looks to be shaping up well. On my aarch64 SBC (Alsa) it's 4% (relative) lower cpu load and on my x86 iMac (Core Audio) about 2%. Other backends I still need to test. As part of testing I'm working on updating Also I've yet to update the cpal backend to work with the other channels. |
self.processing_status.write().current_volume[self.fader] = self.current_volume as f32; | ||
} | ||
self.processing_params | ||
.set_current_volume(self.fader, self.current_volume as f32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still thinking about this. It now writes the current volume always, when I think it is only necessary when in a ramp. That is not the default case and may so be a wasted atomic operation in most cases. But so is storing a variable to see whether the "new" current volume is different from the "current" current volume... maybe that we can make it smarter with a bit of refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current volume just needs to be written while ramping, and at startup and config reload. It should not be overly difficult to handle those cases nicely. Then again, it might not be a worthwhile improvement now with the new much faster atomic writes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it might not. When you do an atomic operation with relaxed ordering, the assembly generated is just a store or load. So "guarding" the operation by doing any sort of "ifs" will only add cpu ops.
Another reason why I would like to refactor it, is that the code is now duplicated. Such a refactoring may still lead to the operation being skipped, but for reasons of maintainability and DRY-ness, not performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, getting rid of repetition is worth spending time on. But it's not unlikely that there will be more changes in these parts. I think it's fine for now.
I just finished reading through the latest version. Looks great! |
Great, thanks. I have been on a business trip and so had not much time lately, but will pick up on it later again for any last wrinkles, if any. |
As far as I can tell this is working perfectly. Thanks for all the good work! |
Sure thing! Bit busy now so will pick up the other pieces later. |
This PR is a first pass at optimising cpu load and latency under concurrency. I have marked it as draft because some questions and testing remain.
Previously, most mutexes were (a)
RwLock
s and (b) frequently locked-and-unlocked. This PR does the following:Replace the
RwLock
s withMutex
es.RwLock
s bring more complexity and overhead that is warranted only in the case of few writers and many (!) readers. Here, readers are not so many and we can do with a standardMutex
and much less overhead.I have considered
parking_lot
as a faster-than-std mutex, but now think of staying withstd::sync::Mutex
. Since Rust 1.62 there is a newMutex
implementation on Linux that is faster thanparking_lot
when contended, and in one benchmark only 89-249 碌s slower thanparking_lot
when slightly or uncontended. Other platforms have likewise had theirstd::Mutex
improved: Tracking issue for improving std::sync::{Mutex, RwLock, Condvar}聽rust-lang/rust#93740Instead of locking-and-unlocking all the time, acquire a single lock to do all reads and writes, then release the lock as soon as possible. I think this strikes a nice balance, offering lower cpu usage at slightly higher contention.
It also ensures that the update of statistics (which is what many of the mutexes are actually about) is consistent: average and peak values, and clipped sample counting, is updated atomically where before a read could have returned a partial update. This is probably academic, but nice to have anyway. In some cases with two mutexes, I made sure two lock both mutexes to start a "transaction" and ensure a consistent state between them.
I have also added CI actions to test all backends and optional features, and fixed a couple of issues along the way. This is otherwise unrelated to this PR; let me know if you want to take it out.
The improvement is not earth-shattering, but ticks off a few percentage points of cpu usage and may improve latency a little.
I see opportunities for some further optimisations, like doing less buffer allocations while holding a lock, and can do those as a second pass in different PRs as I starting going on a file-by-file basis.
Questions:
I can work on replacing the
std::sync::mpsc
channels withcrossbeam-channel
, but coincidence has it thatcrossbeam-channel
will become the defaultstd::sync::mpsc
implementation as of Rust 1.67, which is targeted for release end of January 2023.I propose that we either move everything to
crossbeam-channel
(and not depend on recent Rust compilers) or plan to remove it altogether. I would recommend the latter. What do you think?Is it conceivable that
used_channels
changes during runtime? If not, I could refactor a bit and make sure that buffer conversions are lock-free.In
ProcessingParameters
, there is a separate fieldmute
. Inmake_ramp
this is made equivalent to-100.0
. Do you want to keep it as a separate field or would you consider removing it and in favour ofis_mute = x <= -100.0
instead? While less clean, it would pave the way to change thisMutex
into an atomic read/write and gain some speed in a filter that's "always on".What would happen without the barriers? Is it only to synchronise only start-up / reload? I have not toyed with it yet but I get the feeling that it would be possible to keep the supervisor loop running and check for required states with less of a hammer 馃槃 I know the barrier is not used in a loop so won't harm much, so just something minor.
Could someone verify this works well on Windows? I only have Linux and macOS devices I can compile and test on.