New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations in notify-one #12545
base: main
Are you sure you want to change the base?
Optimizations in notify-one #12545
Conversation
Hi @casualwind! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Hello, I have assigned the CLA, and this is my first PR to RocksDB. Thank you for any review! |
@ajkr Could you or your colleagues help to review my pr? I think this kind of modification may help to the performance on high core servers. If you have any questions please let me know, thank you. |
Hi thanks for the PR. The change makes sense to me. I wonder if you have results for the performance improvement for each step (1. Parallelize SetState vs 2. Wake up non-STATE_LOCKED_WAITING first). I did some benchmark with fillrandom and I saw 1. improves performance but 2. does not show much improvement.
|
@cbi42 Thank you for your review! We have tested the performance before(default --enable_pipelined_write=1), and the step2‘s impact on performance is indeed nearly to none. We put them together because in our previous tests ( the main branch is not the newest), when --enable_pipelined_write=0, step2+step1 got better performance than when there is only step1. Now we tested them on the latest main branch, step2 does have no impact on performance at any time. Currently, we only need the step 1 rather than step1+step2. |
@cbi42 there are 2 commands:
We run command 1: stdev/average is nearly 25% We wonder why their gap so much? |
I just realize that fillrandom[-X5] will reuse the same DB, so it's probably better to just run fillrandom without the [-X5]. |
@cbi42 Thank you for clearing up our doubts. |
db/write_thread.cc
Outdated
|
||
// The minimum number to allow the group use parallel caller mode. | ||
// The number must no lower than 3; | ||
const size_t MinParallelSize = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider setting a higher MinParallelSize
to avoid performance regression. From the graph in summary, it seems the optimization helps when there's close to 40 threads. With ./db_bench --benchmarks=fillrandom[-X1] --threads=40 --seed=1708494134896523 --duration=10 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=1 --disable_wal=0 --enable_pipelined_write=1 --num=100000000 --batch_size=1
and logging write group size in statistics, I got the following write group size distribution:
P50 : 17.658324 P95 : 29.367381 P99 : 35.859741 P100 : 40.000000 COUNT : 650853 SUM : 10974960
. Maybe we can set the threshold to around 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threshold has been changed to 20, I will check the one-pager of core-scaling score with the newest version.
55e2bf6
to
af1fac6
Compare
LGTM, could you update the PR summary with only step 1 and benchmark, and add a change log under https://github.com/facebook/rocksdb/tree/main/unreleased_history/performance_improvements? |
@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@casualwind has updated the pull request. You must reimport the pull request before landing. |
@cbi42 We have updated the PR summary and added a change log. |
We found that for writers s in STATE_LOCKED_WAITING, the notify-one function needs to be called, and the cost of calling this function is very high especially when there are many writers that need to be awakened. So, we Parallelize this progress. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn. vcpu=160, benchmark=db_bench The score is normalized: | case name | optimized/base | |-------------------|----------------| | fillrandom | 182% | | fillseq | 184% | | fillsync | 136% | | overwrite | 179% | | randomreplacekeys | 180% | | randomtransaction | 161% | | updaterandom | 163% | | xorupdaterandom | 165% |
We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.
After analysis, we found that there are two concentrated notify-ones in writing process. One is in LaunchParallelMemTableWriters, which wakes up other writers to write memtable, and the other is in ExitAsBatchGroupLeader, which sets all writer states to STATE_COMPLETED. Although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. We found that for writers s in STATE_LOCKED_WAITING, the notify-one function needs to be called, and the cost of calling this function is very high especially when there are many writers that need to be awakened. So, we try to optimize the cost of waking up writers. The following are our methods.
Assume that there are currently n threads in total:
1. Parallelize SetState in LaunchParallelMemTableWriters
To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.
2. Wake up non-STATE_LOCKED_WAITING writers with priority
The last writer sets the state of the writers that are not STATE_LOCKED_WAITING to STATE_COMPLETED, and then notify-one writers in STATE_LOCKED_WAITING sequentially, so that some writers will not fall from non-BlockingAwaitState to its counterpart at this stage due to the long waiting time.
For example in fillrandom case in db_bench, after reaching a certain number of cores, we can see that the score of the original version begins to decline. But with our optimization, the score has improved significantly compared to the original version. The following shows the one-pager of core-scaling score(op/s).
A reproduction script:
./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60