Optimize mutexes #257

roderickvd · 2022-12-19T21:29:01Z

This PR is a first pass at optimising cpu load and latency under concurrency. I have marked it as draft because some questions and testing remain.

Previously, most mutexes were (a) RwLocks and (b) frequently locked-and-unlocked. This PR does the following:

Replace the RwLocks with Mutexes. RwLocks bring more complexity and overhead that is warranted only in the case of few writers and many (!) readers. Here, readers are not so many and we can do with a standard Mutex and much less overhead.

I have considered parking_lot as a faster-than-std mutex, but now think of staying with std::sync::Mutex. Since Rust 1.62 there is a new Mutex implementation on Linux that is faster than parking_lot when contended, and in one benchmark only 89-249 µs slower than parking_lot when slightly or uncontended. Other platforms have likewise had their std::Mutex improved: Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} rust-lang/rust#93740
Instead of locking-and-unlocking all the time, acquire a single lock to do all reads and writes, then release the lock as soon as possible. I think this strikes a nice balance, offering lower cpu usage at slightly higher contention.

It also ensures that the update of statistics (which is what many of the mutexes are actually about) is consistent: average and peak values, and clipped sample counting, is updated atomically where before a read could have returned a partial update. This is probably academic, but nice to have anyway. In some cases with two mutexes, I made sure two lock both mutexes to start a "transaction" and ensure a consistent state between them.
I have also added CI actions to test all backends and optional features, and fixed a couple of issues along the way. This is otherwise unrelated to this PR; let me know if you want to take it out.

The improvement is not earth-shattering, but ticks off a few percentage points of cpu usage and may improve latency a little.

I see opportunities for some further optimisations, like doing less buffer allocations while holding a lock, and can do those as a second pass in different PRs as I starting going on a file-by-file basis.

Questions:

I can work on replacing the std::sync::mpsc channels with crossbeam-channel, but coincidence has it that crossbeam-channel will become the default std::sync::mpsc implementation as of Rust 1.67, which is targeted for release end of January 2023.

I propose that we either move everything to crossbeam-channel (and not depend on recent Rust compilers) or plan to remove it altogether. I would recommend the latter. What do you think?
Is it conceivable that used_channels changes during runtime? If not, I could refactor a bit and make sure that buffer conversions are lock-free.
In ProcessingParameters, there is a separate field mute. In make_ramp this is made equivalent to -100.0. Do you want to keep it as a separate field or would you consider removing it and in favour of is_mute = x <= -100.0 instead? While less clean, it would pave the way to change this Mutex into an atomic read/write and gain some speed in a filter that's "always on".
What would happen without the barriers? Is it only to synchronise only start-up / reload? I have not toyed with it yet but I get the feeling that it would be possible to keep the supervisor loop running and check for required states with less of a hammer 😄 I know the barrier is not used in a loop so won't harm much, so just something minor.
Could someone verify this works well on Windows? I only have Linux and macOS devices I can compile and test on.

HEnquist · 2022-12-19T22:40:08Z

Instead of locking-and-unlocking all the time, acquire a single lock to do all reads and writes, then release the lock as soon as possible. I think this strikes a nice balance, offering lower cpu usage at slightly higher contention.
It also ensures that the update of statistics (which is what many of the mutexes are actually about) is consistent: average and peak values, and clipped sample counting, is updated atomically where before a read could have returned a partial update. This is probably academic, but nice to have anyway. In some cases with two mutexes, I made sure two lock both mutexes to start a "transaction" and ensure a consistent state between them.

I get slightly worried about the longer times that the mutexes are locked, and that it's no longer possible to read in parallel. There are a few internal readers that need reasonably low latency, such as the Volume and Loudness filters. With this change, polling statistics via the websocket has a larger chance of disturbing the processing. It may be ok, but it needs thorough testing.

I have also added CI actions to test all backends and optional features, and fixed a couple of issues along the way. This is otherwise unrelated to this PR; let me know if you want to take it out.
Good, let's keep them.

The improvement is not earth-shattering, but ticks off a few percentage points of cpu usage and may improve latency a little.

Did you measure? I have just assumed that the unlocking/locking is so cheap that the total cost at the rate I'm unlocking/locking is negligible.

Questions:

I can work on replacing the std::sync::mpsc channels with crossbeam-channel, but coincidence has it that crossbeam-channel will become the default std::sync::mpsc implementation as of Rust 1.67, which is targeted for release end of January 2023.
I propose that we either move everything to crossbeam-channel (and not depend on recent Rust compilers) or plan to remove it altogether. I would recommend the latter. What do you think?
I didn't know that, then it definitely makes sense to wait.

Is it conceivable that used_channels changes during runtime? If not, I could refactor a bit and make sure that buffer conversions are lock-free.
Yes it can change at any time, if a channel is muted/unmuted in a mixer for example.

In ProcessingParameters, there is a separate field mute. In make_ramp this is made equivalent to -100.0. Do you want to keep it as a separate field or would you consider removing it and in favour of is_mute = x <= -100.0 instead? While less clean, it would pave the way to change this Mutex into an atomic read/write and gain some speed in a filter that's "always on".
Both mute and volume are needed, otherwise it can't remember the volume setting while muted. The volume control has been quite tricky to get bug-free and it's pretty important that it behaves properly, please be careful in there :)

What would happen without the barriers? Is it only to synchronise only start-up / reload? I have not toyed with it yet but I get the feeling that it would be possible to keep the supervisor loop running and check for required states with less of a hammer 😄 I know the barrier is not used in a loop so won't harm much, so just something minor.
The barriers are used to synchronise the start of the playback and capture devices. Without some kind of sync it tends to start with one of them immediately running into a buffer over- or underrun. I started with syncing by sending messages in channels, but that got very clunky. It was much easier to use barriers.

Could someone verify this works well on Windows? I only have Linux and macOS devices I can compile and test on.
I can do that, but might take a while before I actually manage to do it.

HEnquist · 2022-12-19T22:47:56Z

src/bin.rs

-                            err
-                        );
+        {
+            let mut next_cfg = next_config.lock().unwrap();


Unless I'm missing something, then this won't work. This lock must be kept free so that it's possible to update the config via the websocket.

You are right, pointy hat to me.

HEnquist · 2022-12-20T08:06:51Z

src/alsadevice.rs

-                                .unwrap_or(());
-                            break;
+        let mut chunk = {
+            let mut capture_status = params.capture_status.lock().unwrap();


This will keep the lock all the way until line 803. There are a few operations along the way that aren't guaranteed to be quick. Especially problematic is pushing messages to channels, since that may block.

Hmm, the special case of those channels blocking I had indeed not considered yet. I did trigger some end of streams and such but I can't vouch for the coverage of such tests.

Of course we could send off those sends to their own thread but at that's getting silly.

The lines of code don't worry me so much, in fact grouping together all mutex operations together is its purpose, and there are a fair bit of arms that are executed only for specific code paths. But indeed whatever the case, the lock must be dropped quickly.

w.r.t. my main comment I will diagnose to understand which part of my changes caused the performance hit. It seems counter-intuitive but here I am 😄

roderickvd · 2022-12-20T20:36:00Z

I get slightly worried about the longer times that the mutexes are locked, and that it's no longer possible to read in parallel. There are a few internal readers that need reasonably low latency, such as the Volume and Loudness filters. With this change, polling statistics via the websocket has a larger chance of disturbing the processing. It may be ok, but it needs thorough testing.

I agree that it needs thorough testing. I am positive that there are gains to be had here, but don't worry -- if this turns out to be a dud then I have no problems shelving it.

The fundamental idea is here that acquiring an RwLock (even a reader) is more expensive than acquiring a Mutex. By how much depends on the operating system. For the sake of argument let's take a factor of two. It may well be more, but let's take the pessimistic case.

Taking the Alsa playback thread as an example, it takes at least three locking operations before sending audio through the channel (in the case of a normally running playback device), more depending on the averages. For a single mutex lock with a latency cost of "1" in the uncontended case, the latency will be at least 1 * 2 (twice as expensive) * 3 (number of locks) = 6.

You are right that in the contended case the latency is not "1" but "1 + queue_time" as we are waiting for the lock. So for the case to be positive, the average queue time must be lower than the penalty incurred by a more complex RwLock. I believe that this should be the case, with the socket server as the primary and almost only contestant but (a) running at a relatively low update rate so contention is low and (b) releasing locks quickly so wait time is minimal.

I hope you don't perceive this as mansplaining, which is not my intention at all. Just trying to convey my thoughts.

The improvement is not earth-shattering, but ticks off a few percentage points of cpu usage and may improve latency a little.

Did you measure? I have just assumed that the unlocking/locking is so cheap that the total cost at the rate I'm unlocking/locking is negligible.

It's not negligible as for example this slide deck shows: https://www.slideshare.net/mitsunorikomatsu/performance-comparison-of-mutex-rwlock-and-atomic-types-in-rust.

But at the same time this is also where I am starting to doubt myself.

I have a Radxa Zero (think a Raspberry Pi Zero) as a lightweight SBC. It's useful to gauge performance exactly because it's not fast -- watching top cpu values stand out much more than on my Core i5 iMac for example. It's not scientific profiling by any means, which I think would be quite hard to do for CamillaDSP.

As I am typing this I retook my measurements and guess what... the current state of this PR is that it has higher cpu usage. Time to rethink this.

Both mute and volume are needed, otherwise it can't remember the volume setting while muted. The volume control has been quite tricky to get bug-free and it's pretty important that it behaves properly, please be careful in there :)

Yes, I can imagine. In the slides above you can see how much faster an atomic operation would be than a mutex lock. For ProcessingParameters we could still do this by not locking the entire struct, but making all fields atomics.

(There is no such thing as an AtomicF32 but as a nice little trick you can use f32::from_bits and f32::to_bits to convert from and into an u32.)

Could someone verify this works well on Windows? I only have Linux and macOS devices I can compile and test on.
I can do that, but might take a while before I actually manage to do it.

If and when we feel this is ripe enough we could consider soliciting assistance from the diyAudio community.

roderickvd · 2022-12-25T21:48:37Z

To improve parallelism, I'm slowly working on a change that's finer-grained again and based on parking_lot::RwLock. That RwLock has got some nice features like read locks that are upgradeable to write locks, and support for hardware lock elision for a nice performance boost on x86.

If it works out, I will also look into transforming the ProcessingParameters into atomic operations.

I think that camillagui and socketserver could use an interface for a bulk update, reducing the number of calls and locks, but I can see about that in a later PR.

Merry Christmas everyone!

roderickvd · 2022-12-28T18:38:22Z

Is it conceivable that used_channels changes during runtime? If not, I could refactor a bit and make sure that buffer conversions are lock-free.

Yes it can change at any time, if a channel is muted/unmuted in a mixer for example.

Working on this I notice that the cpal backend always uses the number of channels that it was opened with, instead of CaptureStatus::used_channels. Is this by design or should it be added?

HEnquist · 2022-12-28T20:32:33Z

Working on this I notice that the cpal backend always uses the number of channels that it was opened with, instead of CaptureStatus::used_channels. Is this by design or should it be added?

That's a mistake! It's supposed to be implemented in all backends (and I thought it already was).

roderickvd · 2022-12-28T21:57:44Z

Working on this I notice that the cpal backend always uses the number of channels that it was opened with, instead of CaptureStatus::used_channels. Is this by design or should it be added?

That's a mistake! It's supposed to be implemented in all backends (and I thought it already was).

I can add it in.

Please ignore these broken builds as I’m working my way to a second iteration of this PR. I’ll signal when I’ve got something presentable.

roderickvd · 2023-01-02T16:06:03Z

This looks to be shaping up well. On my aarch64 SBC (Alsa) it's 4% (relative) lower cpu load and on my x86 iMac (Core Audio) about 2%. Other backends I still need to test.

As part of testing I'm working on updating camillagui to work with the new resampler configuration.

Also I've yet to update the cpal backend to work with the other channels.
Comments and thoughts welcome in the meantime.

roderickvd · 2023-01-02T16:09:06Z

src/basicfilters.rs

-            self.processing_status.write().current_volume[self.fader] = self.current_volume as f32;
-        }
+        self.processing_params
+            .set_current_volume(self.fader, self.current_volume as f32);


I'm still thinking about this. It now writes the current volume always, when I think it is only necessary when in a ramp. That is not the default case and may so be a wasted atomic operation in most cases. But so is storing a variable to see whether the "new" current volume is different from the "current" current volume... maybe that we can make it smarter with a bit of refactoring.

The current volume just needs to be written while ramping, and at startup and config reload. It should not be overly difficult to handle those cases nicely. Then again, it might not be a worthwhile improvement now with the new much faster atomic writes :)

Indeed it might not. When you do an atomic operation with relaxed ordering, the assembly generated is just a store or load. So "guarding" the operation by doing any sort of "ifs" will only add cpu ops.

Another reason why I would like to refactor it, is that the code is now duplicated. Such a refactoring may still lead to the operation being skipped, but for reasons of maintainability and DRY-ness, not performance.

Agreed, getting rid of repetition is worth spending time on. But it's not unlikely that there will be more changes in these parts. I think it's fine for now.

HEnquist · 2023-01-11T22:05:09Z

I just finished reading through the latest version. Looks great!
I'll try running it and switching configs etc, but I don't expect to find any trouble.

roderickvd · 2023-01-14T11:52:35Z

Great, thanks. I have been on a business trip and so had not much time lately, but will pick up on it later again for any last wrinkles, if any.

HEnquist · 2023-01-21T11:03:43Z

As far as I can tell this is working perfectly. Thanks for all the good work!

roderickvd · 2023-01-21T11:11:32Z

Sure thing! Bit busy now so will pick up the other pieces later.

roderickvd added 6 commits December 15, 2022 23:08

Replace RwLock with Mutex

c29616b

Fix new Rust 1.66 clippy warnings

7cb7fa7

Reduce locking operations

df6e357

Add optional features to the CI

0a08007

Fix 32-bit tests

b352e0e

Reduce locking operations when loading config

309c328

HEnquist reviewed Dec 19, 2022

View reviewed changes

HEnquist reviewed Dec 20, 2022

View reviewed changes

Switch to parking_lot::RwLock

a5e530f

roderickvd force-pushed the optimize-mutexes branch from 2f93ed3 to a5e530f Compare December 30, 2022 20:49

roderickvd added 2 commits January 1, 2023 21:17

Make volume operations lock-free

756efa3

Update changelog

15dc50b

roderickvd force-pushed the optimize-mutexes branch from 531e01e to 15dc50b Compare January 1, 2023 20:17

roderickvd commented Jan 2, 2023

View reviewed changes

HEnquist marked this pull request as ready for review January 21, 2023 11:06

HEnquist merged commit 636d97f into HEnquist:next20 Jan 21, 2023

roderickvd deleted the optimize-mutexes branch January 21, 2023 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize mutexes #257

Optimize mutexes #257

roderickvd commented Dec 19, 2022 •

edited

HEnquist commented Dec 19, 2022

HEnquist Dec 19, 2022

roderickvd Dec 20, 2022

HEnquist Dec 20, 2022

roderickvd Dec 20, 2022

roderickvd commented Dec 20, 2022

roderickvd commented Dec 25, 2022

roderickvd commented Dec 28, 2022

HEnquist commented Dec 28, 2022

roderickvd commented Dec 28, 2022

roderickvd commented Jan 2, 2023

roderickvd Jan 2, 2023

HEnquist Jan 11, 2023

roderickvd Jan 14, 2023

HEnquist Jan 21, 2023

HEnquist commented Jan 11, 2023

roderickvd commented Jan 14, 2023 •

edited

HEnquist commented Jan 21, 2023

roderickvd commented Jan 21, 2023

Optimize mutexes #257

Optimize mutexes #257

Conversation

roderickvd commented Dec 19, 2022 • edited

HEnquist commented Dec 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roderickvd commented Dec 20, 2022

roderickvd commented Dec 25, 2022

roderickvd commented Dec 28, 2022

HEnquist commented Dec 28, 2022

roderickvd commented Dec 28, 2022

roderickvd commented Jan 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HEnquist commented Jan 11, 2023

roderickvd commented Jan 14, 2023 • edited

HEnquist commented Jan 21, 2023

roderickvd commented Jan 21, 2023

roderickvd commented Dec 19, 2022 •

edited

roderickvd commented Jan 14, 2023 •

edited