feat(meta): split by table according write throughput #15547

Little-Wallace · 2024-03-08T07:21:46Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

When a large MV creating on based table (or other MV), the data may delay in L0 and can not compact to base level in time. #13075 has proposed a solution, which is check the average throughput and partition the table with large throughput and size.

But there are some problem making that PR does not work in some case:

when we create MV on exist table, the barrier latency would be high and it will take a long time to collect enough window. And after it decides to partition some table, there may be a large data wait in level0 to be compact.
feat(storage): optimize data alignment for default compaction group #13075 will also cut some table in a separate sst file if it has enough data in bottom level, but the size of data may be small in level0. So that solution would generate a lot of small files and each of them only has data of one table.
When the throughput decrease, that solution will not continue to partition this table and it may cause compaction slow down.

We have discuss this problem offline and we all agree that we must split theses tables belong to creating MV into independent group. But it means that we shall also merge those group which do not write much data after creating MV successfully. Before we have implement group merge, this PR can increase compact speed for default group. And although we can split all state-table with large write throughput, it is still better to partition them in advance before split because split-group only effects new data flush after splitting.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

src/meta/src/hummock/manager/mod.rs

Li0k · 2024-04-02T09:43:52Z

src/meta/src/hummock/manager/mod.rs

+                    1,
+                    params.checkpoint_frequency() * barrier_interval_ms / 1000,
+                );
+                let history_table_throughput = self.history_table_throughput.read();


Since we have already used size to simulate throughput, I think there is no need to judge history_table_throughput here?

If the size of the table in input is small, but history_table_throughput is high, do we need to split it? Is this expected?

Yes. I think we shall split it at least every table in one files.

src/meta/src/hummock/compaction/mod.rs

Li0k · 2024-04-02T09:51:19Z

src/meta/src/hummock/compaction/selector/level_selector.rs

-            let total_size = level.total_file_size
-                + handlers[upper_level].get_pending_output_file_size(level.level_idx)
-                - handlers[level_idx].get_pending_output_file_size(level.level_idx + 1);
+            let output_file_size =


Note: this will change the priority of the current level compaction, let us discuss it offline

What is this purpose?

Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx) here?

src/storage/src/hummock/compactor/shared_buffer_compact.rs

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

gitguardian · 2024-04-29T06:25:11Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9425213	Triggered	Generic Password	`e5b4a02`	e2e_test/source/cdc/cdc.validate.postgres.slt	View secret
9425213	Triggered	Generic Password	`7509db8`	e2e_test/source/cdc/cdc.validate.postgres.slt	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Li0k · 2024-05-14T07:45:40Z

src/meta/src/hummock/compaction/selector/level_selector.rs

-            let total_size = level.total_file_size
-                + handlers[upper_level].get_pending_output_file_size(level.level_idx)
-                - handlers[level_idx].get_pending_output_file_size(level.level_idx + 1);
+            let output_file_size =


What is this purpose?

Li0k · 2024-05-14T09:03:03Z

src/meta/src/hummock/manager/compaction.rs

+                for sst in &input_ssts.table_infos {
+                    existing_table_ids.extend(sst.table_ids.iter());
+                    if !sst.table_ids.is_empty() {
+                        *table_size_info.entry(sst.table_ids[0]).or_default() +=


Why only table_ids[0] is counted, is this an estimation algorithm?

Emmm, No. I will fix it. In fact, I split all sst by table just to make here calculation to be accurately

Li0k · 2024-05-14T09:10:35Z

src/meta/src/hummock/manager/compaction.rs

+                1,
+                params.checkpoint_frequency() * barrier_interval_ms / 1000,
+            );
+            let history_table_throughput = self.history_table_throughput.read();


Please add some document

src/meta/src/hummock/manager/compaction.rs

Li0k · 2024-05-14T09:30:24Z

src/storage/src/hummock/sstable/multi_builder.rs

+                self.last_table_id = user_key.table_id.table_id;
+                self.split_weight_by_vnode = 0;
+                self.largest_vnode_in_current_partition = VirtualNode::MAX.to_index();
+                if let Some(builder) = self.current_builder.as_ref()


If we introduce this logic, it may bring more 4m sst when table_id switches. This seems to be contrary to #16495, which will bring more sst at high level

PTAL @hzxa21 @zwang28

But it is only split by table, not partition vnode. I think it is acceptable.

hzxa21 · 2024-05-15T11:16:43Z

src/meta/src/hummock/compaction/selector/level_selector.rs

-            let total_size = level.total_file_size
-                + handlers[upper_level].get_pending_output_file_size(level.level_idx)
-                - handlers[level_idx].get_pending_output_file_size(level.level_idx + 1);
+            let output_file_size =


Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx) here?

hzxa21 · 2024-05-15T15:17:01Z

src/meta/src/hummock/manager/compaction.rs

+                if compact_table_size > compaction_config.max_compaction_bytes / 2 {
+                    compact_task
+                        .table_vnode_partition
+                        .insert(table_id, default_partition_count);


It took me a while to understand why we assign default_partition_count for table with size > max_compaction_bytes / 2. It is because we assign default_partition_count for the split compaction group and we want to treat the table in the hybrid group with large size in the task the same. This can easily confuse others. Can we add some comments?

Just a magic number. I need some value to decide whether to partition for large task
But how to decide whether a task is large task ?
I can give another value which is default_partition_count * compaction_config.target_file_size_base

hzxa21 · 2024-05-15T15:19:01Z

src/meta/src/hummock/manager/compaction.rs

+                } else if compact_table_size > compaction_config.sub_level_max_compaction_bytes
+                    || (compact_table_size > compaction_config.target_file_size_base
+                        && write_throughput > self.env.opts.table_write_throughput_threshold)
+                {
+                    // partition for large write throughput table.
+                    compact_task
+                        .table_vnode_partition
+                        .insert(table_id, hybrid_vnode_count);
+                }
+            }


Same here. Please add some comments here. Also, why do we use sub_level_max_compaction_bytes and target_file_size_base here?

Ok, I add some comment

hzxa21 · 2024-05-15T15:19:53Z

src/meta/src/hummock/manager/compaction.rs

+                if compact_table_size > compaction_config.max_compaction_bytes / 2 {
+                    compact_task
+                        .table_vnode_partition
+                        .insert(table_id, default_partition_count);
+                } else if compact_table_size > compaction_config.sub_level_max_compaction_bytes
+                    || (compact_table_size > compaction_config.target_file_size_base
+                        && write_throughput > self.env.opts.table_write_throughput_threshold)
+                {
+                    // partition for large write throughput table.
+                    compact_task
+                        .table_vnode_partition
+                        .insert(table_id, hybrid_vnode_count);
+                }
+            }


Should we make sure this strategy only affects L0 compaction tasks but not other compaction tasks? Otherwise, I think we can easily create small files in bottom levels where compact_table_size is generally large.

For L0 and base level.
Not other level because we will clear table_vnode_partition in code behind

src/storage/src/hummock/sstable/multi_builder.rs

Little-Wallace · 2024-05-16T10:55:00Z

Same question here. Why do we ignore handlers[upper_level].get_pending_output_file_size(level.level_idx) here?

I refactored code before although RocksDb only reduce pending input file size from current level. But I found that it is not a good idea, because it will compact data earlier before the pending task really change the shape of LSM tree.

src/storage/src/hummock/sstable/multi_builder.rs

src/storage/src/hummock/sstable/builder.rs

hzxa21 · 2024-05-17T05:51:14Z

src/storage/src/hummock/compactor/shared_buffer_compact.rs

+            if let Some(table_size) = table_size_infos.get(table_id)
+                && *table_size > min_sstable_size
+            {
+                table_vnode_partition.insert(*table_id, 1);


Is it beneficial to change shared_buffer_compact to enable table split? Given that tier-compaction is a must and it will take the whole overlapping level as the input, spliting CN SST seems unnecessary?

split to calculate size of each state-table in sst more accurately

why do we need to introduce a new config min_sstable_size instead of taget_file_size_base of sstable_size ?

Because compute-node can not know taget_file_size_base

Li0k · 2024-05-17T07:31:43Z

src/storage/src/hummock/compactor/shared_buffer_compact.rs

+            if let Some(table_size) = table_size_infos.get(table_id)
+                && *table_size > min_sstable_size
+            {
+                table_vnode_partition.insert(*table_id, 1);


why do we need to introduce a new config min_sstable_size instead of taget_file_size_base of sstable_size ?

src/storage/src/hummock/compactor/shared_buffer_compact.rs

src/meta/src/hummock/manager/compaction.rs

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

We have solved this requested in other place

hzxa21

Rest LGTM

hzxa21 · 2024-05-20T06:50:26Z

src/config/docs.md

+| hybrid_few_partition_threshold |  | 134217728 |
+| hybrid_more_partition_threshold |  | 536870912 |


I believe no one will understand what these two configs mean by looking at the names.
How about:
compact_task_table_size_split_threshold_low
compact_task_table_size_split_threshold_high

Also, let's fill-in the description colum here in docks.md to explain what these two configs mean.

how about compact_task_table_size_partition_threshold_low ? Because we do not split these table exactly.

src/meta/src/hummock/manager/compaction.rs

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

src/storage/src/opts.rs

src/common/src/config.rs

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Little-Wallace changed the title ~~split by table according write throughput~~ feat(meta): split by table according write throughput Mar 8, 2024

github-actions bot added Invalid PR Title type/feature labels Mar 8, 2024

split by table according write throughput

4407f12

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Little-Wallace force-pushed the wallace/split-default-group branch from 5d4ac81 to 4407f12 Compare March 11, 2024 06:21

Little-Wallace added 5 commits March 12, 2024 13:11

optimize score

a7f0ced

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

fix create file size

9e5d485

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

optimize overlapping level

1d399b2

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Merge branch 'main' into wallace/split-default-group

230a290

partition overlapping level

e851b87

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Little-Wallace marked this pull request as ready for review March 12, 2024 15:20

Little-Wallace added 4 commits March 26, 2024 15:06

Merge branch 'main' into wallace/split-default-group

15e65b9

Merge branch 'main' into wallace/split-default-group

0e086aa

fix conflict

f912abf

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

split shared buffer

9f41963

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Li0k approved these changes Apr 2, 2024

View reviewed changes

Li0k previously requested changes Apr 2, 2024

View reviewed changes

src/storage/src/hummock/compactor/shared_buffer_compact.rs Outdated Show resolved Hide resolved

Merge branch 'main' into wallace/split-default-group

dca7306

github-actions bot added the ci/run-e2e-single-node-tests label Apr 23, 2024

Little-Wallace added 3 commits April 23, 2024 14:25

fix conflict

f73a113

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

revert some fix

46f47b3

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

fix test

758844d

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Li0k requested review from hzxa21 and zwang28 April 24, 2024 06:06

Little-Wallace mentioned this pull request Apr 24, 2024

perf(compactor): Record changes related to the compactor component #15973

Open

github-actions bot removed the Invalid PR Title label Apr 24, 2024

Little-Wallace added 2 commits April 29, 2024 14:18

tmp change

8eeab14

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Merge branch 'main' into wallace/split-default-group

e5b4a02

split by table for small data

1dfbb1e

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Li0k mentioned this pull request May 11, 2024

Investigate and avoid small SSTs in bottom level #13876

Open

Li0k reviewed May 14, 2024

View reviewed changes

Little-Wallace force-pushed the wallace/split-default-group branch from 35b0072 to 4c9f9fc Compare May 15, 2024 07:32

hzxa21 reviewed May 15, 2024

View reviewed changes

Little-Wallace force-pushed the wallace/split-default-group branch from 7a9cd17 to fc6b2e0 Compare May 16, 2024 11:54

hzxa21 reviewed May 17, 2024

View reviewed changes

Li0k reviewed May 17, 2024

View reviewed changes

address comment

b5ef656

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Little-Wallace force-pushed the wallace/split-default-group branch from 8cd4347 to b5ef656 Compare May 20, 2024 02:24

Little-Wallace added 2 commits May 20, 2024 10:32

Merge branch 'main' into wallace/split-default-group

12b2540

fix check

8d73acc

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Little-Wallace enabled auto-merge May 20, 2024 06:01

Little-Wallace disabled auto-merge May 20, 2024 06:09

hzxa21 approved these changes May 20, 2024

View reviewed changes

Little-Wallace added 2 commits May 20, 2024 14:52

refactor to use the same configure

7d240de

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

add some comments

519f9d9

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Li0k approved these changes May 20, 2024

View reviewed changes

src/storage/src/opts.rs Show resolved Hide resolved

src/common/src/config.rs Outdated Show resolved Hide resolved

Little-Wallace added 5 commits May 24, 2024 14:59

prepare for conflicts

c4fae26

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Merge branch 'main' into wallace/split-default-group

7e32c8a

fix conflict

0299cce

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

fix check

cfad254

Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>

Merge branch 'main' into wallace/split-default-group

2910a05

Little-Wallace enabled auto-merge May 27, 2024 13:57

Little-Wallace added this pull request to the merge queue May 27, 2024

Merged via the queue into main with commit ac93e24 May 27, 2024
27 of 29 checks passed

Little-Wallace deleted the wallace/split-default-group branch May 27, 2024 15:00

Li0k mentioned this pull request May 30, 2024

feat(compaction): adjust target_file_base calculation for base level #17022

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(meta): split by table according write throughput #15547

feat(meta): split by table according write throughput #15547

Little-Wallace commented Mar 8, 2024 •

edited

Li0k Apr 2, 2024

Little-Wallace Apr 23, 2024

Li0k Apr 2, 2024

Li0k May 14, 2024

hzxa21 May 15, 2024

gitguardian bot commented Apr 29, 2024 •

edited

Li0k May 14, 2024

Li0k May 14, 2024

Little-Wallace May 14, 2024

Li0k May 14, 2024

Li0k May 14, 2024

Little-Wallace May 14, 2024

hzxa21 May 15, 2024

hzxa21 May 15, 2024

Little-Wallace May 16, 2024

hzxa21 May 15, 2024

Little-Wallace May 16, 2024

hzxa21 May 15, 2024

Little-Wallace May 16, 2024

Little-Wallace commented May 16, 2024

hzxa21 May 17, 2024

Little-Wallace May 17, 2024

Li0k May 17, 2024

Little-Wallace May 17, 2024

Li0k May 17, 2024

hzxa21 left a comment

hzxa21 May 20, 2024

Little-Wallace May 20, 2024 •

edited

		\| hybrid_few_partition_threshold \| \| 134217728 \|
		\| hybrid_more_partition_threshold \| \| 536870912 \|

feat(meta): split by table according write throughput #15547

feat(meta): split by table according write throughput #15547

Conversation

Little-Wallace commented Mar 8, 2024 • edited

What's changed and what's your intention?

Checklist

Documentation

Release note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gitguardian bot commented Apr 29, 2024 • edited

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Little-Wallace commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Little-Wallace May 20, 2024 • edited

Choose a reason for hiding this comment

Little-Wallace commented Mar 8, 2024 •

edited

gitguardian bot commented Apr 29, 2024 •

edited

Little-Wallace May 20, 2024 •

edited