Multi tree column option #232

MattHalpinParity · 2023-10-26T10:41:27Z

Issue #199
This introduces a multitree option for columns with related tree commit operations as well as querying functions.

…e tree insertion work with commit queue.

…tables to empty.

…sed node sharing.

…built using that tree by deferring the removal.

…been enacted

cheme · 2023-10-30T11:01:55Z

I started looking at the PR, but it is big a big one (I did not look at the bench part yet), so I am summing up here my initial questions.

TreeViewer and defer mechanism. I am not too sure if strictly needed, but I would probably prefer it to be in a follow up PR (even in a different version).
Reference count cache. Samish, having it in a different PR could be good. Otherwhise, is there a risk it takes too much memory (
I mean should be limited in size). Also would it be simpler to just use mmap over its file?
Reference count table. I am not sure why it is kept at Index table level, and not at Value table level. Having it at value table level would mean multiple file, but no reindexing.
Do all node need to be hash indexed (on paper I think only the root needs to be, but I am not to sure of the implementation in front, and it would mean duplicating a few branches)?
The FreeEntries struct looks like an optimization (I did not understand the use of ordered tree though), so may be possible to extract it to a follow up PR (just to reduce this PR size in this case). Also could be optionaly configured at a column level.
not a question, but generally there is many "assert" in code, I would suggest switching them to "debug_assert" or returning an error instead.

MattHalpinParity · 2023-10-31T11:41:20Z

Thanks for the comment. Will try and answer.

I started looking at the PR, but it is big a big one (I did not look at the bench part yet), so I am summing up here my initial questions.

TreeViewer and defer mechanism. I am not too sure if strictly needed, but I would probably prefer it to be in a follow up PR (even in a different version).

The TreeReader stuff is there because we’re using Value table addresses directly as node addresses. Hence we need some way for a client to signal that it’s using nodes from an existing tree and the Db shouldn’t touch them (move or remove). This is what the TreeReader does.
The exception is if you’re in append only mode (eg, for an archive node). In this case you can access node addresses directly because they’ll never move.
The deferal happens if someone dereferences a tree that is currently being used by someone else to build a commit. We need to make sure the tree isn’t removed while it is being used.

Reference count cache. Samish, having it in a different PR could be good. Otherwhise, is there a risk it takes too much memory (
I mean should be limited in size). Also would it be simpler to just use mmap over its file?

The reference count table and cache are purely optimizations so yes could be in a separate PR. They only store reference counts > 1 so are expected to be sparse (most nodes have ref count of 1).
The cache is there because the table requires a search of a whole chunk to determine if a reference count is not there (which is going to be the most common case).
Certainly we will need to check the size of the cache is ok. Archive nodes won’t have any of this ref counting stuff as they’ll be in append only mode. So the ref count table will have to deal with the window of active trees in a full node.

Reference count table. I am not sure why it is kept at Index table level, and not at Value table level. Having it at value table level would mean multiple file, but no reindexing.

Yeah, this is just an optimization so we don’t have to store (and access) ref counts for most of the nodes at all. The first implementation just used the already existing ref counting that Value tables have.

Do all node need to be hash indexed (on paper I think only the root needs to be, but I am not to sure of the implementation in front, and it would mean duplicating a few branches)?

Only the roots have entries in the index table.

The FreeEntries struct looks like an optimization (I did not understand the use of ordered tree though), so may be possible to extract it to a follow up PR (just to reduce this PR size in this case). Also could be optionaly configured at a column level.

When commiting a tree it ‘claims’ value table entries immediately (in commit) which can be done using the table header data and the free list without reading or writing any table data or log overlays. You’re right that the ordered list isn’t actually used yet! That would be used in the future for optimizing trees by laying out nodes contiguously.

not a question, but generally there is many "assert" in code, I would suggest switching them to "debug_assert" or returning an error instead.

Thanks, I’ll have a look at that!

cheme · 2023-10-31T15:05:33Z

thanks for the replies,

The TreeReader stuff is there because we’re using Value table addresses directly as node addresses. Hence we need some way for a client to signal that it’s using nodes from an existing tree and the Db shouldn’t touch them (move or remove). This is what the TreeReader does.

We need to make sure the tree isn’t removed while it is being used.

I don't know if we need this, an error sounds fine to me.

But ok, I guess I got it: if user try to access at a given address A and in between it got deleted and used again (very likelly due to the way we use the address back), then it will get an incorrect result but would not know it.

I was just thinking it would get an error but I was wrong.

I guess the tree viewer make sense (I could think of checking root presence in some cache every time we access node, but it is certainly better to just update an atomic over reader when we drop the tree, and then just return an error to get_node if tree_viewer is deprecated.

So for a start I would find it easier to go with a treeviewer that do not defer commit, commit will just make all treeviewer session invalid.
From a user point of view it is certainly not as good, but our usecase is just merkle trie node pruning so it doesn't sounds bad to me.
The advantage would be simpler initial code.

The reference count table and cache are purely optimizations so yes could be in a separate PR. They only store reference counts > 1 so are expected to be sparse (most nodes have ref count of 1).

I did not think about it, can be sparse indeed.

The first implementation just used the already existing ref counting that Value tables have.

I understand that putting the reference count in their own table can be good, but I would have expected the table to get indexed indentically to the values tables, is it?

Only the roots have entries in the index table.

Then what happens for the reference count?? I am quite sure I don't get the reference count model, and the need to reindex it.
Or do we just use reference count for root?

That would be used in the future for optimizing trees by laying out nodes contiguously.

👍

MattHalpinParity · 2023-11-01T16:22:12Z

thanks for the replies,

The TreeReader stuff is there because we’re using Value table addresses directly as node addresses. Hence we need some way for a client to signal that it’s using nodes from an existing tree and the Db shouldn’t touch them (move or remove). This is what the TreeReader does.

We need to make sure the tree isn’t removed while it is being used.

I don't know if we need this, an error sounds fine to me.

But ok, I guess I got it: if user try to access at a given address A and in between it got deleted and used again (very likelly due to the way we use the address back), then it will get an incorrect result but would not know it.

I was just thinking it would get an error but I was wrong.

I guess the tree viewer make sense (I could think of checking root presence in some cache every time we access node, but it is certainly better to just update an atomic over reader when we drop the tree, and then just return an error to get_node if tree_viewer is deprecated.

So for a start I would find it easier to go with a treeviewer that do not defer commit, commit will just make all treeviewer session invalid. From a user point of view it is certainly not as good, but our usecase is just merkle trie node pruning so it doesn't sounds bad to me. The advantage would be simpler initial code.

Right. The client would have to cope with potentially failed commits? I think the Db would still need to track which TreeReaders were active when a dereference happened.

The reference count table and cache are purely optimizations so yes could be in a separate PR. They only store reference counts > 1 so are expected to be sparse (most nodes have ref count of 1).

I did not think about it, can be sparse indeed.

The first implementation just used the already existing ref counting that Value tables have.

I understand that putting the reference count in their own table can be good, but I would have expected the table to get indexed indentically to the values tables, is it?

It uses a hash of the Value table address. This is because only a small proportion of nodes need a reference count. So didn’t want a ref count entry for every node.

Only the roots have entries in the index table.

Then what happens for the reference count?? I am quite sure I don't get the reference count model, and the need to reindex it. Or do we just use reference count for root?

Roots act like a standard Hash column entry. They have an index table entry and a Value table entry with potentially ref counting or not depending on if the client code wanted ref counted roots.
Other nodes only get a Value table entry and a ref count table entry. The reason for reindexing is because only some nodes need an entry in the ref count table (as it only stores > 1 ref count) so the ref count table isn’t the same size as the Value table and could grow in size differently. It is also randomly accessed using hashes because the entries that need a ref count are sparse. This also means reindexing is needed when growing.
Wrt the ref counting model, this is what it does: When committing a new tree all nodes get ref count 1. If a node uses a child node that exists from another tree then the child ref count is incremented.
Then on tree deletion it decrements ref count and if it hits 0 it removes the node and decrements ref counts of all children. This recurses as needed.

That would be used in the future for optimizing trees by laying out nodes contiguously.

👍

cheme

I think I finally get how the rc table works .
I tried to slimdown the comment I did at the time, especially the incorrect ones, but some may have slip through.

cheme · 2023-10-26T10:48:22Z

admin/src/lib.rs

+				db_options.columns.push(info_column);
+			}
+
+			let info_column = &mut db_options.columns[1];


could use multitree_bench::INFO_COLUMN rather than 1 and 2 .

src/log.rs

cheme · 2023-10-27T13:44:22Z

src/db.rs

+#[derive(Debug, PartialEq, Eq)]
+pub enum NodeChange {
+	/// (address, value, compressed value, compressed)
+	NewValue(u64, RcValue, RcValue, bool),


Isn't compressed defined per column ? (if so would be better to not have it here)

This is whether the specific value was actually compressed or not (eg, if compressed length isn't smaller than the initial value then it is not compressed).
Though values are never compressed right now as discussed elsewhere so I could just remove it for now and store a single value.

cheme · 2023-10-27T14:30:05Z

src/table.rs

@@ -698,6 +749,11 @@ impl ValueTable {
 				last_removed,
 			);
 			self.last_removed.store(next_removed, Ordering::Relaxed);
+			if let Some(mut free_entries) = free_entries_guard {
+				let last = free_entries.stack.pop().unwrap();
+				assert_eq!(last, last_removed);


Wouldn't it be possible to skip reading last_removed here? (and remove assert_eq).
Otherwise would replace assert_eq by a debug_assert here.

Is this actually used (or is claim_next_free mainly use)?

Made it debug_assert. next_free does need to maintain the free list correctly as root nodes can share value tables with child nodes and root nodes still use standard committing (with next_free).

cheme · 2023-10-27T14:35:33Z

src/table.rs

+				let last_removed = self.last_removed.load(Ordering::Relaxed);
+				let index = if last_removed != 0 {
+					let last = free_entries.stack.pop().unwrap();
+					assert_eq!(last, last_removed);


Suggested change

assert_eq!(last, last_removed);

debug_assert!(last == last_removed);

cheme · 2024-01-07T12:06:13Z

src/db.rs

+							}
+						}
+					}
+					// TODO: Remove TreeReader from Db.


Not too sure about the meaning of this TODO?

src/log.rs

src/ref_count.rs

src/table.rs

src/db.rs

MattHalpinParity · 2024-01-07T21:54:02Z

I think I finally get how the rc table works . I tried to slimdown the comment I did at the time, especially the incorrect ones, but some may have slip through.

Thanks. Will have a look at them and fix/reply.

arkpar · 2024-01-08T10:51:45Z

admin/src/multitree_bench/mod.rs

Ideally this should've been a --tree option to the strees command, rather than a separate command. I'm fine with this version for now, but consider merging them in the future.

arkpar · 2024-01-09T11:23:29Z

src/column.rs

+	1 + data.len() + num_children as usize * 8
+}
+
+pub fn pack_node_data(data: Vec<u8>, child_data: Vec<u8>, num_children: u8) -> Vec<u8> {


Right, the optimal layout should be [data][children][num_children]. This way unpacking function can just shrink the input vec and avoid a memmove. Packing would also be appending to data and returning it.

src/table.rs

arkpar · 2024-01-09T11:34:39Z

src/table.rs

@@ -1000,6 +1124,11 @@ impl ValueTable {
 	}

 	pub fn complete_plan(&self, log: &mut LogWriter) -> Result<()> {
+		let _free_entries_guard = if let Some(free_entries) = &self.free_entries {
+			Some(free_entries.write())


Wouldn't read lock suffice here?

Yes, read should work here. Done.

arkpar · 2024-01-09T11:46:02Z

src/column.rs

+			let mut data_buf = [0u8; 8];
+			data_buf.copy_from_slice(&address.to_le_bytes());
+			data.append(&mut data_buf.to_vec());


Suggested change

let mut data_buf = [0u8; 8];

data_buf.copy_from_slice(&address.to_le_bytes());

data.append(&mut data_buf.to_vec());

data.extend_from_slice(&address.to_le_bytes());

arkpar · 2024-01-09T12:13:27Z

src/db.rs

+		}
+		if !self.options.columns[col as usize].append_only && external_call {
+			return Err(Error::InvalidConfiguration(
+				"get_node can only be called on a column with append_only option.".to_string(),


Should be allowed for now. We won't be using the TreeReader in substrate initially

There won’t be any guarantee that the NodeAddress is still valid though, as it might have been removed. The Db won’t even be able to warn if it has happened.
Is there some external guarantee that this won’t happen?

Yes, in substrate there's a higher level state pinning mechanism. It prevents the tree from being deleted while there are active readers.

Ok. Should I add an option so the client has to choose to forego any checks/guarantees?

Added the option.

arkpar · 2024-01-09T12:47:43Z

src/db.rs

+		old_bytes: usize,
+		old_id: u64,
+		new_id: Option<u64>,
+	) -> Result<()> {


Alternative solution is to keep nodes referenced by TreeReaders in a shared memory storage. Once the node that's referenced by any of the live nodes in the cache is deleted from the disk, it is evicted to the memory storage. Something to cosider doing later.

For now I'd agree with @cheme. It looks like we can get away with simply detecting reading a dereferenced tree and producing an error.

src/db.rs

arkpar · 2024-01-09T13:08:02Z

src/db.rs

@@ -1499,6 +2135,78 @@ impl IndexedChangeSet {
 			}
 			*ops += 1;
 		}
+		for change in self.node_changes.iter() {
+			match change {


Would it possible to move all that code and write_dereference_children_plan to the column module?

I tried this but it's a bit messy as the code potentially needs to create TreeReaders and they need DbInner.

arkpar · 2024-01-09T13:14:25Z

src/ref_count.rs

+}
+
+#[cfg(test)]
+mod test {


Would be good to get some tests working

cheme · 2024-01-09T16:30:48Z

I'd rather keep it as key. ParityDb does not have a concept of "key in the tree". For all it cares, it stores nodes and their children. It does not have to be a prefix/radix tree.

I think at the end of the review I got accustomed to this usage, so ok for keeping key then.

cheme

Looks like most concerns are addressed.

Maybe the one with passing db two time in function:

I drafted this branch here for it:

https://github.com/MattHalpinParity/parity-db/compare/multi_tree...cheme:parity-db:cheme/multi-tree?expand=1
specifically MattHalpinParity@f2c8dce

but it is a bit verbose (lot of .0 added), maybe just changing proto of
fn fn_name(&self, db: &Arc, .....

to
fn fn_name(db: &Arc, .....

would be better (even if then we have some Self::fn_name calls instead of self.fn_name calls).

(I will be updating to this branch on one of my polkadot sdk, and try to open a draft pr as soon as)

cheme · 2024-02-12T08:59:55Z

src/column.rs

+		return Err(Error::InvalidValueData)
+	}
+	let data_len = data.len() - (child_buf_len + 1);
+	let mut children = Children::new();


Suggested change

let mut children = Children::new();

let mut children = Children::with_capacity(num_children);

cheme · 2024-02-12T09:08:51Z

src/column.rs

+		tier_index: &mut HashMap<usize, usize>,
+		node_values: &mut Vec<NodeChange>,
+		data: &mut Vec<u8>,
+	) -> Result<()> {


Would almost propose to return impl Iterator<Item = Address> by using Iterator::from_fn (not passing data). This way we can keep the pack_data function (next to unpack it makes it easy to check format is consistent).
But I am not sure if it make things easier to follow here.

arkpar · 2024-02-15T11:01:53Z

Maybe the one with passing db two time in function:

I'd rather just use Arc::new_cyclic and keep a weak pointer to itself in DbInner.

MattHalpinParity and others added 30 commits May 24, 2023 11:40

Multitree root commit with blocking log write

a284ebc

fmt

f088832

Implement get_root and get_node

5353e83

Initial work on readers

05367c2

Working readers

ccb0155

Working iterator

27c3d06

Added TreeReader for accessing tree root and nodes

a479ce1

get_tree returns RwLock reader

d4afd3e

Track ValueTable free entries in memory

f8d825c

Added ability to claim free entries from ValueTable. Used this to mak…

5a50c1d

…e tree insertion work with commit queue.

Make ChainGenerator generate trees that share nodes from previous trees

902fa02

Tree commits share existing nodes

a057eed

Stress test tree pruning. Tree removal (Currently just removes root).

366cb7b

Implemented tree removal with reference counting for shared nodes

4461eb4

Empty on shutdown option. This removes all trees and waits for value …

73cf3ec

…tables to empty.

Depth based age histograms for more accurate chain generation. Increa…

ba5a913

…sed node sharing.

Prepare for using claim_contiguous_entries

7365866

Append-only mode

5703cc2

Check RC on dereferencing root

6cdf863

Correctly use full key or hash

634a314

Safer entry claiming. Deal with tree removal while a commit is being …

ee11d9f

…built using that tree by deferring the removal.

Separate tree operations

06b96e2

Added various checks for correct usage

b04121d

Reference count tables

1d73416

fmt

5aa61af

Remove value table verification of ref counts

8545181

Only create and use ref count table when needed

91d10ad

Multitree stress fix for appending to existing database

5709d13

On restart table data needs to be generated after all log files have …

29e2070

…been enacted

In memory ref count cache. Verifies with table.

29c94d3

MattHalpinParity added 4 commits October 26, 2023 12:12

Windows fix

29db340

Fix

5ffa528

Fix

08ae7d1

Loom RwLock requires Sized

3d263d7

MattHalpinParity marked this pull request as ready for review October 30, 2023 10:59

cheme reviewed Jan 7, 2024

View reviewed changes

arkpar requested changes Jan 9, 2024

View reviewed changes

MattHalpinParity added 12 commits January 10, 2024 12:32

Debug asserts

59594a2

Use INFO_COLUMN as column index

b1e90f0

Avoid changing existing log action values

9662b92

Typo

0077b00

Fix

115b5c8

Remove ordered for now

0424bdf

Remove claim_next_free

dff334d

Read lock

e3e9032

Improved node data packing and unpacking

b566577

Simple ref_count tests

fad6255

Remove unused multi tree node compression

963d7ef

Added allow_direct_node_access column option

d4caa3f

cheme approved these changes Feb 15, 2024

View reviewed changes

arkpar approved these changes Feb 15, 2024

View reviewed changes

arkpar merged commit c9125f8 into paritytech:master Feb 15, 2024
9 checks passed

MattHalpinParity deleted the multi_tree branch February 29, 2024 12:11

	assert_eq!(last, last_removed);
	debug_assert!(last == last_removed);

	let mut children = Children::new();
	let mut children = Children::with_capacity(num_children);

Multi tree column option #232

Multi tree column option #232

Conversation

MattHalpinParity commented Oct 26, 2023

cheme commented Oct 30, 2023

MattHalpinParity commented Oct 31, 2023

cheme commented Oct 31, 2023

MattHalpinParity commented Nov 1, 2023

cheme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MattHalpinParity commented Jan 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheme commented Jan 9, 2024

cheme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arkpar commented Feb 15, 2024