Optimizations in notify-one #12545

casualwind · 2024-04-16T08:03:07Z

We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.

case name	optimized/original
fillrandom	182%
fillseq	184%
fillsync	136%
overwrite	179%
randomreplacekeys	180%
randomtransaction	161%
updaterandom	163%
xorupdaterandom	165%

After analysis, we found that there are two concentrated notify-ones in writing process. One is in LaunchParallelMemTableWriters, which wakes up other writers to write memtable, and the other is in ExitAsBatchGroupLeader, which sets all writer states to STATE_COMPLETED. Although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. We found that for writers s in STATE_LOCKED_WAITING, the notify-one function needs to be called, and the cost of calling this function is very high especially when there are many writers that need to be awakened. So, we try to optimize the cost of waking up writers. The following are our methods.

Assume that there are currently n threads in total:

1. Parallelize SetState in LaunchParallelMemTableWriters

To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.

2. Wake up non-STATE_LOCKED_WAITING writers with priority

The last writer sets the state of the writers that are not STATE_LOCKED_WAITING to STATE_COMPLETED, and then notify-one writers in STATE_LOCKED_WAITING sequentially, so that some writers will not fall from non-BlockingAwaitState to its counterpart at this stage due to the long waiting time.

For example in fillrandom case in db_bench, after reaching a certain number of cores, we can see that the score of the original version begins to decline. But with our optimization, the score has improved significantly compared to the original version. The following shows the one-pager of core-scaling score(op/s).

A reproduction script:
./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60

facebook-github-bot · 2024-04-16T08:03:12Z

Hi @casualwind!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

casualwind · 2024-04-25T05:06:30Z

Hello, I have assigned the CLA, and this is my first PR to RocksDB. Thank you for any review!

casualwind · 2024-05-06T06:38:24Z

@ajkr Could you or your colleagues help to review my pr? I think this kind of modification may help to the performance on high core servers. If you have any questions please let me know, thank you.

cbi42 · 2024-05-10T22:41:49Z

Hi thanks for the PR. The change makes sense to me. I wonder if you have results for the performance improvement for each step (1. Parallelize SetState vs 2. Wake up non-STATE_LOCKED_WAITING first). I did some benchmark with fillrandom and I saw 1. improves performance but 2. does not show much improvement.

Command:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

main:
fillrandom [AVG    5 runs] : 130401 (± 535) ops/sec;   14.4 (± 0.1) MB/sec
fillrandom [MEDIAN 5 runs] : 130207 ops/sec;   14.4 MB/sec

this PR: 
fillrandom [AVG    5 runs] : 282247 (± 70009) ops/sec;   31.2 (± 7.7) MB/sec
fillrandom [MEDIAN 5 runs] : 317884 ops/sec;   35.2 MB/sec

this PR without 2.
fillrandom [AVG    5 runs] : 281672 (± 70145) ops/sec;   31.2 (± 7.8) MB/sec
fillrandom [MEDIAN 5 runs] : 317164 ops/sec;   35.1 MB/sec

casualwind · 2024-05-13T07:06:31Z

@cbi42 Thank you for your review!
for each step :1. Parallelize SetState 2. Wake up non-STATE_LOCKED_WAITING first

We have tested the performance before(default --enable_pipelined_write=1), and the step2‘s impact on performance is indeed nearly to none.

We put them together because in our previous tests ( the main branch is not the newest), when --enable_pipelined_write=0, step2+step1 got better performance than when there is only step1. Now we tested them on the latest main branch, step2 does have no impact on performance at any time.

Currently, we only need the step 1 rather than step1+step2.

casualwind · 2024-05-13T07:13:45Z

@cbi42
We have a doubt besides.

there are 2 commands:

Command 1:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

Command 2:
./db_bench --benchmarks=fillrandom --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

We run command 1: stdev/average is nearly 25%
We run command 2 for 5 times: stdev/average is under 1%.

We wonder why their gap so much?

cbi42 · 2024-05-14T00:24:24Z

We wonder why their gap so much?

@cbi42 We have a doubt besides.

there are 2 commands:
Command 1:
./db_bench --benchmarks=fillrandom[-X5] --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0

Command 2:
./db_bench --benchmarks=fillrandom --threads=160 --seed=1708494134896523 --duration=30 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=0
We run command 1: stdev/average is nearly 25% We run command 2 for 5 times: stdev/average is under 1%.

We wonder why their gap so much?

I just realize that fillrandom[-X5] will reuse the same DB, so it's probably better to just run fillrandom without the [-X5].

casualwind · 2024-05-14T05:21:27Z

@cbi42 Thank you for clearing up our doubts.

cbi42 · 2024-05-15T00:09:06Z

db/write_thread.cc

+
+  // The minimum number to allow the group use parallel caller mode.
+  // The number must no lower than 3;
+  const size_t MinParallelSize = 5;


Consider setting a higher MinParallelSize to avoid performance regression. From the graph in summary, it seems the optimization helps when there's close to 40 threads. With ./db_bench --benchmarks=fillrandom[-X1] --threads=40 --seed=1708494134896523 --duration=10 --disable_auto_compactions=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --statistics=1 --disable_wal=0 --enable_pipelined_write=1 --num=100000000 --batch_size=1 and logging write group size in statistics, I got the following write group size distribution:
P50 : 17.658324 P95 : 29.367381 P99 : 35.859741 P100 : 40.000000 COUNT : 650853 SUM : 10974960. Maybe we can set the threshold to around 20?

The threshold has been changed to 20, I will check the one-pager of core-scaling score with the newest version.

cbi42 · 2024-05-16T03:56:38Z

LGTM, could you update the PR summary with only step 1 and benchmark, and add a change log under https://github.com/facebook/rocksdb/tree/main/unreleased_history/performance_improvements?

facebook-github-bot · 2024-05-16T03:57:15Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-17T06:52:58Z

@casualwind has updated the pull request. You must reimport the pull request before landing.

casualwind · 2024-05-17T06:53:58Z

@cbi42 We have updated the PR summary and added a change log.

We found that for writers s in STATE_LOCKED_WAITING, the notify-one function needs to be called, and the cost of calling this function is very high especially when there are many writers that need to be awakened. So, we Parallelize this progress. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn. vcpu=160, benchmark=db_bench The score is normalized: | case name | optimized/base | |-------------------|----------------| | fillrandom | 182% | | fillseq | 184% | | fillsync | 136% | | overwrite | 179% | | randomreplacekeys | 180% | | randomtransaction | 161% | | updaterandom | 163% | | xorupdaterandom | 165% |

casualwind changed the title ~~Thread opt~~ Optimizations in notify-one to improve the performance Apr 16, 2024

casualwind changed the title ~~Optimizations in notify-one to improve the performance~~ Optimizations in notify-one to Apr 16, 2024

casualwind changed the title ~~Optimizations in notify-one to~~ Optimizations in notify-one Apr 16, 2024

facebook-github-bot added the CLA Signed label Apr 23, 2024

casualwind force-pushed the thread-opt branch from ab5224e to ebbf892 Compare April 24, 2024 02:12

casualwind force-pushed the thread-opt branch from ebbf892 to e3959f7 Compare April 28, 2024 01:35

casualwind force-pushed the thread-opt branch from e3959f7 to 6ef77c6 Compare May 13, 2024 07:54

cbi42 reviewed May 15, 2024

View reviewed changes

casualwind force-pushed the thread-opt branch 2 times, most recently from 55e2bf6 to af1fac6 Compare May 15, 2024 08:39

casualwind force-pushed the thread-opt branch from af1fac6 to 6f20406 Compare May 17, 2024 06:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations in notify-one #12545

Optimizations in notify-one #12545

casualwind commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

casualwind commented Apr 25, 2024

casualwind commented May 6, 2024

cbi42 commented May 10, 2024

casualwind commented May 13, 2024 •

edited

casualwind commented May 13, 2024

cbi42 commented May 14, 2024

casualwind commented May 14, 2024

cbi42 May 15, 2024

casualwind May 15, 2024 •

edited

cbi42 commented May 16, 2024

facebook-github-bot commented May 16, 2024

facebook-github-bot commented May 17, 2024

casualwind commented May 17, 2024

Optimizations in notify-one #12545

Are you sure you want to change the base?

Optimizations in notify-one #12545

Conversation

casualwind commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

Action Required

Process

casualwind commented Apr 25, 2024

casualwind commented May 6, 2024

cbi42 commented May 10, 2024

casualwind commented May 13, 2024 • edited

casualwind commented May 13, 2024

cbi42 commented May 14, 2024

casualwind commented May 14, 2024

cbi42 May 15, 2024

Choose a reason for hiding this comment

casualwind May 15, 2024 • edited

Choose a reason for hiding this comment

cbi42 commented May 16, 2024

facebook-github-bot commented May 16, 2024

facebook-github-bot commented May 17, 2024

casualwind commented May 17, 2024

casualwind commented May 13, 2024 •

edited

casualwind May 15, 2024 •

edited