Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations in notify-one #12545

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

casualwind
Copy link

We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.

case name optimized/original
fillrandom 182%
fillseq 184%
fillsync 136%
overwrite 179%
randomreplacekeys 180%
randomtransaction 161%
updaterandom 163%
xorupdaterandom 165%

After analysis, we found that there are two concentrated notify-ones in writing process. One is in LaunchParallelMemTableWriters, which wakes up other writers to write memtable, and the other is in ExitAsBatchGroupLeader, which sets all writer states to STATE_COMPLETED. Although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. We found that for writers s in STATE_LOCKED_WAITING, the notify-one function needs to be called, and the cost of calling this function is very high especially when there are many writers that need to be awakened. So, we try to optimize the cost of waking up writers. The following are our methods.

Assume that there are currently n threads in total:

1. Parallelize SetState in LaunchParallelMemTableWriters

To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.

2. Wake up non-STATE_LOCKED_WAITING writers with priority

The last writer sets the state of the writers that are not STATE_LOCKED_WAITING to STATE_COMPLETED, and then notify-one writers in STATE_LOCKED_WAITING sequentially, so that some writers will not fall from non-BlockingAwaitState to its counterpart at this stage due to the long waiting time.

For example in fillrandom case in db_bench, after reaching a certain number of cores, we can see that the score of the original version begins to decline. But with our optimization, the score has improved significantly compared to the original version. The following shows the one-pager of core-scaling score(op/s).

A reproduction script:
./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60

image

@facebook-github-bot
Copy link
Contributor

Hi @casualwind!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@casualwind casualwind changed the title Thread opt Optimizations in notify-one to improve the performance Apr 16, 2024
@casualwind casualwind changed the title Optimizations in notify-one to improve the performance Optimizations in notify-one to Apr 16, 2024
@casualwind casualwind changed the title Optimizations in notify-one to Optimizations in notify-one Apr 16, 2024
@casualwind
Copy link
Author

Hello, I have assigned the CLA, and this is my first PR to RocksDB. Thank you for any review!

@casualwind
Copy link
Author

@ajkr Could you or your colleagues help to review my pr? I think this kind of modification may help to the performance on high core servers. If you have any questions please let me know, thank you.

@cbi42
Copy link
Member

cbi42 commented May 10, 2024

Hi thanks for the PR. The change makes sense to me. I wonder if you have results for the performance improvement for each step (1. Parallelize SetState vs 2. Wake up non-STATE_LOCKED_WAITING first). I did some benchmark with fillrandom and I saw 1. improves performance but 2. does not show much improvement.

Command:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

main:
fillrandom [AVG    5 runs] : 130401 (± 535) ops/sec;   14.4 (± 0.1) MB/sec
fillrandom [MEDIAN 5 runs] : 130207 ops/sec;   14.4 MB/sec

this PR: 
fillrandom [AVG    5 runs] : 282247 (± 70009) ops/sec;   31.2 (± 7.7) MB/sec
fillrandom [MEDIAN 5 runs] : 317884 ops/sec;   35.2 MB/sec

this PR without 2.
fillrandom [AVG    5 runs] : 281672 (± 70145) ops/sec;   31.2 (± 7.8) MB/sec
fillrandom [MEDIAN 5 runs] : 317164 ops/sec;   35.1 MB/sec

@casualwind
Copy link
Author

casualwind commented May 13, 2024

@cbi42 Thank you for your review!
for each step :1. Parallelize SetState 2. Wake up non-STATE_LOCKED_WAITING first

We have tested the performance before(default --enable_pipelined_write=1), and the step2‘s impact on performance is indeed nearly to none.

We put them together because in our previous tests ( the main branch is not the newest), when --enable_pipelined_write=0, step2+step1 got better performance than when there is only step1. Now we tested them on the latest main branch, step2 does have no impact on performance at any time.

Currently, we only need the step 1 rather than step1+step2.

@casualwind
Copy link
Author

@cbi42
We have a doubt besides.

there are 2 commands:

Command 1:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

Command 2:
./db_bench --benchmarks=fillrandom --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

We run command 1: stdev/average is nearly 25%
We run command 2 for 5 times: stdev/average is under 1%.

We wonder why their gap so much?

@cbi42
Copy link
Member

cbi42 commented May 14, 2024

We wonder why their gap so much?

@cbi42 We have a doubt besides.

there are 2 commands:

Command 1:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

Command 2:
./db_bench --benchmarks=fillrandom --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

We run command 1: stdev/average is nearly 25% We run command 2 for 5 times: stdev/average is under 1%.

We wonder why their gap so much?

I just realize that fillrandom[-X5] will reuse the same DB, so it's probably better to just run fillrandom without the [-X5].

@casualwind
Copy link
Author

@cbi42 Thank you for clearing up our doubts.


// The minimum number to allow the group use parallel caller mode.
// The number must no lower than 3;
const size_t MinParallelSize = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider setting a higher MinParallelSize to avoid performance regression. From the graph in summary, it seems the optimization helps when there's close to 40 threads. With ./db_bench --benchmarks=fillrandom[-X1] --threads=40 --seed=1708494134896523 --duration=10 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=1 --disable_wal=0 --enable_pipelined_write=1 --num=100000000 --batch_size=1 and logging write group size in statistics, I got the following write group size distribution:
P50 : 17.658324 P95 : 29.367381 P99 : 35.859741 P100 : 40.000000 COUNT : 650853 SUM : 10974960. Maybe we can set the threshold to around 20?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threshold has been changed to 20, I will check the one-pager of core-scaling score with the newest version.

@casualwind casualwind force-pushed the thread-opt branch 2 times, most recently from 55e2bf6 to af1fac6 Compare May 15, 2024 08:39
@cbi42
Copy link
Member

cbi42 commented May 16, 2024

LGTM, could you update the PR summary with only step 1 and benchmark, and add a change log under https://github.com/facebook/rocksdb/tree/main/unreleased_history/performance_improvements?

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@casualwind has updated the pull request. You must reimport the pull request before landing.

@casualwind
Copy link
Author

@cbi42 We have updated the PR summary and added a change log.

We found that for writers s in STATE_LOCKED_WAITING, the notify-one
function needs to be called, and the cost of calling this function is
very high especially when there are many writers that need to be
awakened. So, we Parallelize this progress.

To wake up each writer to write its own memtable, the leader writer
first wakes up the (n^0.5-1) caller writers, and then those callers
and the leader will wake up n/x separately to write to the memtable.
This reduces the number for the leader's to SetState n-1 writers
to 2*(n^0.5) writers in turn.

vcpu=160, benchmark=db_bench
The score is normalized:
| case name         | optimized/base |
|-------------------|----------------|
| fillrandom        | 182%           |
| fillseq           | 184%           |
| fillsync          | 136%           |
| overwrite         | 179%           |
| randomreplacekeys | 180%           |
| randomtransaction | 161%           |
| updaterandom      | 163%           |
| xorupdaterandom   | 165%           |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants