Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex performance drop in case of cache contention #418

Open
pingzhaozz opened this issue Nov 6, 2023 · 0 comments
Open

Mutex performance drop in case of cache contention #418

pingzhaozz opened this issue Nov 6, 2023 · 0 comments

Comments

@pingzhaozz
Copy link

pingzhaozz commented Nov 6, 2023

There's a problem of parking-lot Mutex when met cache line contention, the performance dropped much. It can also be observed with lock-bench

$ cargo run --release 32 2 10000 100
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 113.030539ms min 105.760154ms max 131.87258ms
parking_lot::Mutex   avg 403.756547ms min 326.026509ms max 533.260014ms
spin::Mutex          avg 161.125953ms min 151.708034ms max 177.132377ms
AmdSpinlock          avg 158.042233ms min 148.723058ms max 171.265994ms

It's observed on INTEL 120 cores CPU.

Debug shows currently spin() strategy will enter Parking state after "spinwait" failure. The Parking mechanism introduces an overhead of approximately ~100ms(from lock-bench data). In case of cache contention in multi-core, multi-thread scenarios, the likelihood of spin() failures is significantly higher, leading to longer lock durations(there's another lock in parking thread list which may meet cache contention too). Considering the ~100ms overhead of parking and the millisecond-level or even lower <1ms durations of spin, there needs to be a buffer transition between them to avoid the performance loss caused by frequent entries into the parking state.

Currently yield is used in spinwait. When there are multiple threads running on the same core, yield can effectively alleviate contention problem. However, when the scheduler's ready queue contains only the current thread, the yield effect is minimal and may not effective . A possible way is adding some sleep before parking and after spin(). The result shows good which is better than using mm_pause or yield. It improves the lock bench much and shows better cpu utility. I'll submit a PR later.

Sleep 1ms before parking:

$cargo run --release 32 2 10000 100
Options {
    n_threads: 32,
    n_locks: 2,
    n_ops: 10000,
    n_rounds: 100,
}

std::sync::Mutex     avg 113.276158ms min 103.870893ms max 131.823024ms
parking_lot::Mutex   avg 81.669426ms  min 72.584055ms  max 88.3535ms
spin::Mutex          avg 161.586476ms min 152.302867ms max 184.132674ms
AmdSpinlock          avg 157.446488ms min 147.091038ms max 180.832205ms

pingzhaozz added a commit to pingzhaozz/parking_lot that referenced this issue Nov 6, 2023
To avoid entering Parking too frequently in case of cache contention,
adding sleep 1ms, 4 times before parking and after old 'spin()'.

Signed-off-by: Ping Zhao <ping.zhao@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant