Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi tree column option #232

Merged
merged 54 commits into from
Feb 15, 2024
Merged

Conversation

MattHalpinParity
Copy link
Contributor

Issue #199
This introduces a multitree option for columns with related tree commit operations as well as querying functions.

MattHalpinParity and others added 30 commits May 24, 2023 11:40
…built using that tree by deferring the removal.
@MattHalpinParity MattHalpinParity marked this pull request as ready for review October 30, 2023 10:59
@cheme
Copy link
Collaborator

cheme commented Oct 30, 2023

I started looking at the PR, but it is big a big one (I did not look at the bench part yet), so I am summing up here my initial questions.

  • TreeViewer and defer mechanism. I am not too sure if strictly needed, but I would probably prefer it to be in a follow up PR (even in a different version).

  • Reference count cache. Samish, having it in a different PR could be good. Otherwhise, is there a risk it takes too much memory (
    I mean should be limited in size). Also would it be simpler to just use mmap over its file?

  • Reference count table. I am not sure why it is kept at Index table level, and not at Value table level. Having it at value table level would mean multiple file, but no reindexing.

  • Do all node need to be hash indexed (on paper I think only the root needs to be, but I am not to sure of the implementation in front, and it would mean duplicating a few branches)?

  • The FreeEntries struct looks like an optimization (I did not understand the use of ordered tree though), so may be possible to extract it to a follow up PR (just to reduce this PR size in this case). Also could be optionaly configured at a column level.

  • not a question, but generally there is many "assert" in code, I would suggest switching them to "debug_assert" or returning an error instead.

@MattHalpinParity
Copy link
Contributor Author

Thanks for the comment. Will try and answer.

I started looking at the PR, but it is big a big one (I did not look at the bench part yet), so I am summing up here my initial questions.

  • TreeViewer and defer mechanism. I am not too sure if strictly needed, but I would probably prefer it to be in a follow up PR (even in a different version).

The TreeReader stuff is there because we’re using Value table addresses directly as node addresses. Hence we need some way for a client to signal that it’s using nodes from an existing tree and the Db shouldn’t touch them (move or remove). This is what the TreeReader does.
The exception is if you’re in append only mode (eg, for an archive node). In this case you can access node addresses directly because they’ll never move.
The deferal happens if someone dereferences a tree that is currently being used by someone else to build a commit. We need to make sure the tree isn’t removed while it is being used.

  • Reference count cache. Samish, having it in a different PR could be good. Otherwhise, is there a risk it takes too much memory (
    I mean should be limited in size). Also would it be simpler to just use mmap over its file?

The reference count table and cache are purely optimizations so yes could be in a separate PR. They only store reference counts > 1 so are expected to be sparse (most nodes have ref count of 1).
The cache is there because the table requires a search of a whole chunk to determine if a reference count is not there (which is going to be the most common case).
Certainly we will need to check the size of the cache is ok. Archive nodes won’t have any of this ref counting stuff as they’ll be in append only mode. So the ref count table will have to deal with the window of active trees in a full node.

  • Reference count table. I am not sure why it is kept at Index table level, and not at Value table level. Having it at value table level would mean multiple file, but no reindexing.

Yeah, this is just an optimization so we don’t have to store (and access) ref counts for most of the nodes at all. The first implementation just used the already existing ref counting that Value tables have.

  • Do all node need to be hash indexed (on paper I think only the root needs to be, but I am not to sure of the implementation in front, and it would mean duplicating a few branches)?

Only the roots have entries in the index table.

  • The FreeEntries struct looks like an optimization (I did not understand the use of ordered tree though), so may be possible to extract it to a follow up PR (just to reduce this PR size in this case). Also could be optionaly configured at a column level.

When commiting a tree it ‘claims’ value table entries immediately (in commit) which can be done using the table header data and the free list without reading or writing any table data or log overlays. You’re right that the ordered list isn’t actually used yet! That would be used in the future for optimizing trees by laying out nodes contiguously.

  • not a question, but generally there is many "assert" in code, I would suggest switching them to "debug_assert" or returning an error instead.

Thanks, I’ll have a look at that!

@cheme
Copy link
Collaborator

cheme commented Oct 31, 2023

thanks for the replies,

The TreeReader stuff is there because we’re using Value table addresses directly as node addresses. Hence we need some way for a client to signal that it’s using nodes from an existing tree and the Db shouldn’t touch them (move or remove). This is what the TreeReader does.

We need to make sure the tree isn’t removed while it is being used.

I don't know if we need this, an error sounds fine to me.

But ok, I guess I got it: if user try to access at a given address A and in between it got deleted and used again (very likelly due to the way we use the address back), then it will get an incorrect result but would not know it.

I was just thinking it would get an error but I was wrong.

I guess the tree viewer make sense (I could think of checking root presence in some cache every time we access node, but it is certainly better to just update an atomic over reader when we drop the tree, and then just return an error to get_node if tree_viewer is deprecated.

So for a start I would find it easier to go with a treeviewer that do not defer commit, commit will just make all treeviewer session invalid.
From a user point of view it is certainly not as good, but our usecase is just merkle trie node pruning so it doesn't sounds bad to me.
The advantage would be simpler initial code.

The reference count table and cache are purely optimizations so yes could be in a separate PR. They only store reference counts > 1 so are expected to be sparse (most nodes have ref count of 1).

I did not think about it, can be sparse indeed.

The first implementation just used the already existing ref counting that Value tables have.

I understand that putting the reference count in their own table can be good, but I would have expected the table to get indexed indentically to the values tables, is it?

Only the roots have entries in the index table.

Then what happens for the reference count?? I am quite sure I don't get the reference count model, and the need to reindex it.
Or do we just use reference count for root?

That would be used in the future for optimizing trees by laying out nodes contiguously.

👍

@MattHalpinParity
Copy link
Contributor Author

thanks for the replies,

The TreeReader stuff is there because we’re using Value table addresses directly as node addresses. Hence we need some way for a client to signal that it’s using nodes from an existing tree and the Db shouldn’t touch them (move or remove). This is what the TreeReader does.

We need to make sure the tree isn’t removed while it is being used.

I don't know if we need this, an error sounds fine to me.

But ok, I guess I got it: if user try to access at a given address A and in between it got deleted and used again (very likelly due to the way we use the address back), then it will get an incorrect result but would not know it.

I was just thinking it would get an error but I was wrong.

I guess the tree viewer make sense (I could think of checking root presence in some cache every time we access node, but it is certainly better to just update an atomic over reader when we drop the tree, and then just return an error to get_node if tree_viewer is deprecated.

So for a start I would find it easier to go with a treeviewer that do not defer commit, commit will just make all treeviewer session invalid. From a user point of view it is certainly not as good, but our usecase is just merkle trie node pruning so it doesn't sounds bad to me. The advantage would be simpler initial code.

Right. The client would have to cope with potentially failed commits? I think the Db would still need to track which TreeReaders were active when a dereference happened.

The reference count table and cache are purely optimizations so yes could be in a separate PR. They only store reference counts > 1 so are expected to be sparse (most nodes have ref count of 1).

I did not think about it, can be sparse indeed.

The first implementation just used the already existing ref counting that Value tables have.

I understand that putting the reference count in their own table can be good, but I would have expected the table to get indexed indentically to the values tables, is it?

It uses a hash of the Value table address. This is because only a small proportion of nodes need a reference count. So didn’t want a ref count entry for every node.

Only the roots have entries in the index table.

Then what happens for the reference count?? I am quite sure I don't get the reference count model, and the need to reindex it. Or do we just use reference count for root?

Roots act like a standard Hash column entry. They have an index table entry and a Value table entry with potentially ref counting or not depending on if the client code wanted ref counted roots.
Other nodes only get a Value table entry and a ref count table entry. The reason for reindexing is because only some nodes need an entry in the ref count table (as it only stores > 1 ref count) so the ref count table isn’t the same size as the Value table and could grow in size differently. It is also randomly accessed using hashes because the entries that need a ref count are sparse. This also means reindexing is needed when growing.
Wrt the ref counting model, this is what it does: When committing a new tree all nodes get ref count 1. If a node uses a child node that exists from another tree then the child ref count is incremented.
Then on tree deletion it decrements ref count and if it hits 0 it removes the node and decrements ref counts of all children. This recurses as needed.

That would be used in the future for optimizing trees by laying out nodes contiguously.

👍

Copy link
Collaborator

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I finally get how the rc table works .
I tried to slimdown the comment I did at the time, especially the incorrect ones, but some may have slip through.

admin/src/lib.rs Outdated
db_options.columns.push(info_column);
}

let info_column = &mut db_options.columns[1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use multitree_bench::INFO_COLUMN rather than 1 and 2 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/log.rs Outdated Show resolved Hide resolved
src/db.rs Outdated
#[derive(Debug, PartialEq, Eq)]
pub enum NodeChange {
/// (address, value, compressed value, compressed)
NewValue(u64, RcValue, RcValue, bool),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't compressed defined per column ? (if so would be better to not have it here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is whether the specific value was actually compressed or not (eg, if compressed length isn't smaller than the initial value then it is not compressed).
Though values are never compressed right now as discussed elsewhere so I could just remove it for now and store a single value.

src/table.rs Outdated
@@ -698,6 +749,11 @@ impl ValueTable {
last_removed,
);
self.last_removed.store(next_removed, Ordering::Relaxed);
if let Some(mut free_entries) = free_entries_guard {
let last = free_entries.stack.pop().unwrap();
assert_eq!(last, last_removed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be possible to skip reading last_removed here? (and remove assert_eq).
Otherwise would replace assert_eq by a debug_assert here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually used (or is claim_next_free mainly use)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it debug_assert. next_free does need to maintain the free list correctly as root nodes can share value tables with child nodes and root nodes still use standard committing (with next_free).

src/table.rs Outdated
let last_removed = self.last_removed.load(Ordering::Relaxed);
let index = if last_removed != 0 {
let last = free_entries.stack.pop().unwrap();
assert_eq!(last, last_removed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(last, last_removed);
debug_assert!(last == last_removed);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}
}
// TODO: Remove TreeReader from Db.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too sure about the meaning of this TODO?

src/log.rs Outdated Show resolved Hide resolved
src/ref_count.rs Show resolved Hide resolved
src/table.rs Outdated Show resolved Hide resolved
src/db.rs Show resolved Hide resolved
@MattHalpinParity
Copy link
Contributor Author

I think I finally get how the rc table works . I tried to slimdown the comment I did at the time, especially the incorrect ones, but some may have slip through.

Thanks. Will have a look at them and fix/reply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should've been a --tree option to the strees command, rather than a separate command. I'm fine with this version for now, but consider merging them in the future.

src/column.rs Outdated
1 + data.len() + num_children as usize * 8
}

pub fn pack_node_data(data: Vec<u8>, child_data: Vec<u8>, num_children: u8) -> Vec<u8> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the optimal layout should be [data][children][num_children]. This way unpacking function can just shrink the input vec and avoid a memmove. Packing would also be appending to data and returning it.

src/table.rs Show resolved Hide resolved
src/table.rs Outdated
@@ -1000,6 +1124,11 @@ impl ValueTable {
}

pub fn complete_plan(&self, log: &mut LogWriter) -> Result<()> {
let _free_entries_guard = if let Some(free_entries) = &self.free_entries {
Some(free_entries.write())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't read lock suffice here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, read should work here. Done.

src/column.rs Outdated
Comment on lines 1075 to 1077
let mut data_buf = [0u8; 8];
data_buf.copy_from_slice(&address.to_le_bytes());
data.append(&mut data_buf.to_vec());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut data_buf = [0u8; 8];
data_buf.copy_from_slice(&address.to_le_bytes());
data.append(&mut data_buf.to_vec());
data.extend_from_slice(&address.to_le_bytes());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src/db.rs Outdated
}
if !self.options.columns[col as usize].append_only && external_call {
return Err(Error::InvalidConfiguration(
"get_node can only be called on a column with append_only option.".to_string(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be allowed for now. We won't be using the TreeReader in substrate initially

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won’t be any guarantee that the NodeAddress is still valid though, as it might have been removed. The Db won’t even be able to warn if it has happened.
Is there some external guarantee that this won’t happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in substrate there's a higher level state pinning mechanism. It prevents the tree from being deleted while there are active readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Should I add an option so the client has to choose to forego any checks/guarantees?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the option.

old_bytes: usize,
old_id: u64,
new_id: Option<u64>,
) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative solution is to keep nodes referenced by TreeReaders in a shared memory storage. Once the node that's referenced by any of the live nodes in the cache is deleted from the disk, it is evicted to the memory storage. Something to cosider doing later.

For now I'd agree with @cheme. It looks like we can get away with simply detecting reading a dereferenced tree and producing an error.

src/db.rs Show resolved Hide resolved
@@ -1499,6 +2135,78 @@ impl IndexedChangeSet {
}
*ops += 1;
}
for change in self.node_changes.iter() {
match change {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it possible to move all that code and write_dereference_children_plan to the column module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but it's a bit messy as the code potentially needs to create TreeReaders and they need DbInner.

}

#[cfg(test)]
mod test {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to get some tests working

@cheme
Copy link
Collaborator

cheme commented Jan 9, 2024

I'd rather keep it as key. ParityDb does not have a concept of "key in the tree". For all it cares, it stores nodes and their children. It does not have to be a prefix/radix tree.

I think at the end of the review I got accustomed to this usage, so ok for keeping key then.

Copy link
Collaborator

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like most concerns are addressed.

Maybe the one with passing db two time in function:

I drafted this branch here for it:

https://github.com/MattHalpinParity/parity-db/compare/multi_tree...cheme:parity-db:cheme/multi-tree?expand=1
specifically MattHalpinParity@f2c8dce

but it is a bit verbose (lot of .0 added), maybe just changing proto of
fn fn_name(&self, db: &Arc, .....

to
fn fn_name(db: &Arc, .....

would be better (even if then we have some Self::fn_name calls instead of self.fn_name calls).

(I will be updating to this branch on one of my polkadot sdk, and try to open a draft pr as soon as)

return Err(Error::InvalidValueData)
}
let data_len = data.len() - (child_buf_len + 1);
let mut children = Children::new();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut children = Children::new();
let mut children = Children::with_capacity(num_children);

tier_index: &mut HashMap<usize, usize>,
node_values: &mut Vec<NodeChange>,
data: &mut Vec<u8>,
) -> Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would almost propose to return impl Iterator<Item = Address> by using Iterator::from_fn (not passing data). This way we can keep the pack_data function (next to unpack it makes it easy to check format is consistent).
But I am not sure if it make things easier to follow here.

@arkpar
Copy link
Member

arkpar commented Feb 15, 2024

Maybe the one with passing db two time in function:

I'd rather just use Arc::new_cyclic and keep a weak pointer to itself in DbInner.

@arkpar arkpar merged commit c9125f8 into paritytech:master Feb 15, 2024
9 checks passed
@MattHalpinParity MattHalpinParity deleted the multi_tree branch February 29, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants