Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize mutexes #257

Merged
merged 9 commits into from
Jan 21, 2023
Merged

Optimize mutexes #257

merged 9 commits into from
Jan 21, 2023

Conversation

roderickvd
Copy link
Contributor

@roderickvd roderickvd commented Dec 19, 2022

This PR is a first pass at optimising cpu load and latency under concurrency. I have marked it as draft because some questions and testing remain.

Previously, most mutexes were (a) RwLocks and (b) frequently locked-and-unlocked. This PR does the following:

  1. Replace the RwLocks with Mutexes. RwLocks bring more complexity and overhead that is warranted only in the case of few writers and many (!) readers. Here, readers are not so many and we can do with a standard Mutex and much less overhead.

    I have considered parking_lot as a faster-than-std mutex, but now think of staying with std::sync::Mutex. Since Rust 1.62 there is a new Mutex implementation on Linux that is faster than parking_lot when contended, and in one benchmark only 89-249 碌s slower than parking_lot when slightly or uncontended. Other platforms have likewise had their std::Mutex improved: Tracking issue for improving std::sync::{Mutex, RwLock, Condvar}聽rust-lang/rust#93740

  2. Instead of locking-and-unlocking all the time, acquire a single lock to do all reads and writes, then release the lock as soon as possible. I think this strikes a nice balance, offering lower cpu usage at slightly higher contention.

    It also ensures that the update of statistics (which is what many of the mutexes are actually about) is consistent: average and peak values, and clipped sample counting, is updated atomically where before a read could have returned a partial update. This is probably academic, but nice to have anyway. In some cases with two mutexes, I made sure two lock both mutexes to start a "transaction" and ensure a consistent state between them.

  3. I have also added CI actions to test all backends and optional features, and fixed a couple of issues along the way. This is otherwise unrelated to this PR; let me know if you want to take it out.

The improvement is not earth-shattering, but ticks off a few percentage points of cpu usage and may improve latency a little.

I see opportunities for some further optimisations, like doing less buffer allocations while holding a lock, and can do those as a second pass in different PRs as I starting going on a file-by-file basis.

Questions:

  • I can work on replacing the std::sync::mpsc channels with crossbeam-channel, but coincidence has it that crossbeam-channel will become the default std::sync::mpsc implementation as of Rust 1.67, which is targeted for release end of January 2023.

    I propose that we either move everything to crossbeam-channel (and not depend on recent Rust compilers) or plan to remove it altogether. I would recommend the latter. What do you think?

  • Is it conceivable that used_channels changes during runtime? If not, I could refactor a bit and make sure that buffer conversions are lock-free.

  • In ProcessingParameters, there is a separate field mute. In make_ramp this is made equivalent to -100.0. Do you want to keep it as a separate field or would you consider removing it and in favour of is_mute = x <= -100.0 instead? While less clean, it would pave the way to change this Mutex into an atomic read/write and gain some speed in a filter that's "always on".

  • What would happen without the barriers? Is it only to synchronise only start-up / reload? I have not toyed with it yet but I get the feeling that it would be possible to keep the supervisor loop running and check for required states with less of a hammer 馃槃 I know the barrier is not used in a loop so won't harm much, so just something minor.

  • Could someone verify this works well on Windows? I only have Linux and macOS devices I can compile and test on.

@HEnquist
Copy link
Owner

  1. Instead of locking-and-unlocking all the time, acquire a single lock to do all reads and writes, then release the lock as soon as possible. I think this strikes a nice balance, offering lower cpu usage at slightly higher contention.
    It also ensures that the update of statistics (which is what many of the mutexes are actually about) is consistent: average and peak values, and clipped sample counting, is updated atomically where before a read could have returned a partial update. This is probably academic, but nice to have anyway. In some cases with two mutexes, I made sure two lock both mutexes to start a "transaction" and ensure a consistent state between them.

I get slightly worried about the longer times that the mutexes are locked, and that it's no longer possible to read in parallel. There are a few internal readers that need reasonably low latency, such as the Volume and Loudness filters. With this change, polling statistics via the websocket has a larger chance of disturbing the processing. It may be ok, but it needs thorough testing.

  1. I have also added CI actions to test all backends and optional features, and fixed a couple of issues along the way. This is otherwise unrelated to this PR; let me know if you want to take it out.
    Good, let's keep them.

The improvement is not earth-shattering, but ticks off a few percentage points of cpu usage and may improve latency a little.

Did you measure? I have just assumed that the unlocking/locking is so cheap that the total cost at the rate I'm unlocking/locking is negligible.

Questions:

  • I can work on replacing the std::sync::mpsc channels with crossbeam-channel, but coincidence has it that crossbeam-channel will become the default std::sync::mpsc implementation as of Rust 1.67, which is targeted for release end of January 2023.
    I propose that we either move everything to crossbeam-channel (and not depend on recent Rust compilers) or plan to remove it altogether. I would recommend the latter. What do you think?
    I didn't know that, then it definitely makes sense to wait.
  • Is it conceivable that used_channels changes during runtime? If not, I could refactor a bit and make sure that buffer conversions are lock-free.
    Yes it can change at any time, if a channel is muted/unmuted in a mixer for example.
  • In ProcessingParameters, there is a separate field mute. In make_ramp this is made equivalent to -100.0. Do you want to keep it as a separate field or would you consider removing it and in favour of is_mute = x <= -100.0 instead? While less clean, it would pave the way to change this Mutex into an atomic read/write and gain some speed in a filter that's "always on".
    Both mute and volume are needed, otherwise it can't remember the volume setting while muted. The volume control has been quite tricky to get bug-free and it's pretty important that it behaves properly, please be careful in there :)
  • What would happen without the barriers? Is it only to synchronise only start-up / reload? I have not toyed with it yet but I get the feeling that it would be possible to keep the supervisor loop running and check for required states with less of a hammer 馃槃 I know the barrier is not used in a loop so won't harm much, so just something minor.
    The barriers are used to synchronise the start of the playback and capture devices. Without some kind of sync it tends to start with one of them immediately running into a buffer over- or underrun. I started with syncing by sending messages in channels, but that got very clunky. It was much easier to use barriers.
  • Could someone verify this works well on Windows? I only have Linux and macOS devices I can compile and test on.
    I can do that, but might take a while before I actually manage to do it.

src/bin.rs Outdated
err
);
{
let mut next_cfg = next_config.lock().unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, then this won't work. This lock must be kept free so that it's possible to update the config via the websocket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, pointy hat to me.

.unwrap_or(());
break;
let mut chunk = {
let mut capture_status = params.capture_status.lock().unwrap();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will keep the lock all the way until line 803. There are a few operations along the way that aren't guaranteed to be quick. Especially problematic is pushing messages to channels, since that may block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the special case of those channels blocking I had indeed not considered yet. I did trigger some end of streams and such but I can't vouch for the coverage of such tests.

Of course we could send off those sends to their own thread but at that's getting silly.

The lines of code don't worry me so much, in fact grouping together all mutex operations together is its purpose, and there are a fair bit of arms that are executed only for specific code paths. But indeed whatever the case, the lock must be dropped quickly.

w.r.t. my main comment I will diagnose to understand which part of my changes caused the performance hit. It seems counter-intuitive but here I am 馃槃

@roderickvd
Copy link
Contributor Author

I get slightly worried about the longer times that the mutexes are locked, and that it's no longer possible to read in parallel. There are a few internal readers that need reasonably low latency, such as the Volume and Loudness filters. With this change, polling statistics via the websocket has a larger chance of disturbing the processing. It may be ok, but it needs thorough testing.

I agree that it needs thorough testing. I am positive that there are gains to be had here, but don't worry -- if this turns out to be a dud then I have no problems shelving it.

The fundamental idea is here that acquiring an RwLock (even a reader) is more expensive than acquiring a Mutex. By how much depends on the operating system. For the sake of argument let's take a factor of two. It may well be more, but let's take the pessimistic case.

Taking the Alsa playback thread as an example, it takes at least three locking operations before sending audio through the channel (in the case of a normally running playback device), more depending on the averages. For a single mutex lock with a latency cost of "1" in the uncontended case, the latency will be at least 1 * 2 (twice as expensive) * 3 (number of locks) = 6.

You are right that in the contended case the latency is not "1" but "1 + queue_time" as we are waiting for the lock. So for the case to be positive, the average queue time must be lower than the penalty incurred by a more complex RwLock. I believe that this should be the case, with the socket server as the primary and almost only contestant but (a) running at a relatively low update rate so contention is low and (b) releasing locks quickly so wait time is minimal.

I hope you don't perceive this as mansplaining, which is not my intention at all. Just trying to convey my thoughts.

The improvement is not earth-shattering, but ticks off a few percentage points of cpu usage and may improve latency a little.

Did you measure? I have just assumed that the unlocking/locking is so cheap that the total cost at the rate I'm unlocking/locking is negligible.

It's not negligible as for example this slide deck shows: https://www.slideshare.net/mitsunorikomatsu/performance-comparison-of-mutex-rwlock-and-atomic-types-in-rust.

But at the same time this is also where I am starting to doubt myself.

I have a Radxa Zero (think a Raspberry Pi Zero) as a lightweight SBC. It's useful to gauge performance exactly because it's not fast -- watching top cpu values stand out much more than on my Core i5 iMac for example. It's not scientific profiling by any means, which I think would be quite hard to do for CamillaDSP.

As I am typing this I retook my measurements and guess what... the current state of this PR is that it has higher cpu usage. Time to rethink this.

Both mute and volume are needed, otherwise it can't remember the volume setting while muted. The volume control has been quite tricky to get bug-free and it's pretty important that it behaves properly, please be careful in there :)

Yes, I can imagine. In the slides above you can see how much faster an atomic operation would be than a mutex lock. For ProcessingParameters we could still do this by not locking the entire struct, but making all fields atomics.

(There is no such thing as an AtomicF32 but as a nice little trick you can use f32::from_bits and f32::to_bits to convert from and into an u32.)

  • Could someone verify this works well on Windows? I only have Linux and macOS devices I can compile and test on.
    I can do that, but might take a while before I actually manage to do it.

If and when we feel this is ripe enough we could consider soliciting assistance from the diyAudio community.

@roderickvd
Copy link
Contributor Author

To improve parallelism, I'm slowly working on a change that's finer-grained again and based on parking_lot::RwLock. That RwLock has got some nice features like read locks that are upgradeable to write locks, and support for hardware lock elision for a nice performance boost on x86.

If it works out, I will also look into transforming the ProcessingParameters into atomic operations.

I think that camillagui and socketserver could use an interface for a bulk update, reducing the number of calls and locks, but I can see about that in a later PR.

Merry Christmas everyone!

@roderickvd
Copy link
Contributor Author

  • Is it conceivable that used_channels changes during runtime? If not, I could refactor a bit and make sure that buffer conversions are lock-free.

Yes it can change at any time, if a channel is muted/unmuted in a mixer for example.

Working on this I notice that the cpal backend always uses the number of channels that it was opened with, instead of CaptureStatus::used_channels. Is this by design or should it be added?

@HEnquist
Copy link
Owner

Working on this I notice that the cpal backend always uses the number of channels that it was opened with, instead of CaptureStatus::used_channels. Is this by design or should it be added?

That's a mistake! It's supposed to be implemented in all backends (and I thought it already was).

@roderickvd
Copy link
Contributor Author

Working on this I notice that the cpal backend always uses the number of channels that it was opened with, instead of CaptureStatus::used_channels. Is this by design or should it be added?

That's a mistake! It's supposed to be implemented in all backends (and I thought it already was).

I can add it in.

Please ignore these broken builds as I鈥檓 working my way to a second iteration of this PR. I鈥檒l signal when I鈥檝e got something presentable.

@roderickvd
Copy link
Contributor Author

This looks to be shaping up well. On my aarch64 SBC (Alsa) it's 4% (relative) lower cpu load and on my x86 iMac (Core Audio) about 2%. Other backends I still need to test.

As part of testing I'm working on updating camillagui to work with the new resampler configuration.

Also I've yet to update the cpal backend to work with the other channels.
Comments and thoughts welcome in the meantime.

self.processing_status.write().current_volume[self.fader] = self.current_volume as f32;
}
self.processing_params
.set_current_volume(self.fader, self.current_volume as f32);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking about this. It now writes the current volume always, when I think it is only necessary when in a ramp. That is not the default case and may so be a wasted atomic operation in most cases. But so is storing a variable to see whether the "new" current volume is different from the "current" current volume... maybe that we can make it smarter with a bit of refactoring.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current volume just needs to be written while ramping, and at startup and config reload. It should not be overly difficult to handle those cases nicely. Then again, it might not be a worthwhile improvement now with the new much faster atomic writes :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it might not. When you do an atomic operation with relaxed ordering, the assembly generated is just a store or load. So "guarding" the operation by doing any sort of "ifs" will only add cpu ops.

Another reason why I would like to refactor it, is that the code is now duplicated. Such a refactoring may still lead to the operation being skipped, but for reasons of maintainability and DRY-ness, not performance.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, getting rid of repetition is worth spending time on. But it's not unlikely that there will be more changes in these parts. I think it's fine for now.

@HEnquist
Copy link
Owner

I just finished reading through the latest version. Looks great!
I'll try running it and switching configs etc, but I don't expect to find any trouble.

@roderickvd
Copy link
Contributor Author

roderickvd commented Jan 14, 2023

Great, thanks. I have been on a business trip and so had not much time lately, but will pick up on it later again for any last wrinkles, if any.

@HEnquist
Copy link
Owner

As far as I can tell this is working perfectly. Thanks for all the good work!

@HEnquist HEnquist marked this pull request as ready for review January 21, 2023 11:06
@HEnquist HEnquist merged commit 636d97f into HEnquist:next20 Jan 21, 2023
@roderickvd
Copy link
Contributor Author

Sure thing! Bit busy now so will pick up the other pieces later.

@roderickvd roderickvd deleted the optimize-mutexes branch January 21, 2023 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants