Box most NodeData variants #74

jyn514 · 2020-07-01T21:16:16Z

This reduces memory usage in NodeData by a factor of 5 (80 -> 16),
and usage in Node by a factor of 2 (120 -> 56).

Closes #73

This reduces memory usage in NodeData by a factor of 5 (80 -> 16), and usage in Node by a factor of 2 (120 -> 56).

SimonSapin · 2020-07-02T07:05:18Z

This reduces memory usage in NodeData by a factor of 5 (80 -> 16),
and usage in Node by a factor of 2 (120 -> 56).

This analysis is definitely too simplistic because it ignores the memory usage of Box allocations. Yes Box<RefCell<String>> is 8 bytes but the 32 bytes for RefCell<String> are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.

The idea of boxing some enum variants is to reduce unused padding for other, smaller variants. The overall effect is very dependent on which variants are more frequent in some concrete use case. Here DocumentFragment is likely the variant that profits the most from this PR, but I suspect it’s very uncommon in real-world document.

I suspect that unfortunately in this enum the largest variant, Element, is typically also one of the most frequent. So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.

jyn514 · 2020-07-02T11:59:56Z

So I’d be inclined not to take a PR like this without some measurement or statistics that show that it actually helps.

Would this do? #73 (comment) (see servo/html5ever#420 (comment) to replicate)

jyn514 · 2020-07-02T12:05:09Z

This analysis is definitely too simplistic because it ignores the memory usage of Box allocations. Yes Box<RefCell> is 8 bytes but the 32 bytes for RefCell are still stored somewhere. In some cases adding a pointer indirection can increase total memory usage, because the 8 bytes for that pointer also need to be stored.

This means that Text now uses a total of 40 bytes of memory, but that's still half of the 80 it was taking up before. All of the variants besides ~~Element~~ Doctype now use less memory overall than they did before. ~~Element~~ Doctype has now grown from 80 -> 88 which seems like a reasonable tradeoff.

SimonSapin · 2020-07-02T18:25:21Z

Here is the size of the fields of each variant as of 596ecac. (For the size of the enum we need to add the discriminant, which is 8 bytes with padding because of field alignment.) I think this should have been the very first step of this discussion.

    assert_eq!(64, size_of::<crate::ElementData>());
    assert_eq!(32, size_of::<std::cell::RefCell<String>>()); // Text, Comment
    assert_eq!(56, size_of::<std::cell::RefCell<(String, String)>>());  // ProcessingInstruction
    assert_eq!(72, size_of::<crate::Doctype>());
    assert_eq!(1, size_of::<crate::DocumentData>());

I’m gonna claim without backing data that elements and text are very common, comments somewhat uncommon, and everything else exceptional. (Processing instructions are an XML thing and are not generated by the HTML parser. Each document typically has one document node, and zero or one doctype.)

For uncommon enum variants we care about the direct size_of which increases the padding for other variants. For common variants we care about the total memory use, including separate allocations.

With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.

For elements and text I think it’s much less obvious, since they are common. But I think we can explore ideas other than just adding Box in a few places.

For example, how common are elements with zero attributes? If it’s more than 33% it would be a win to replace Attributes in ElementData with Option<Box<Attributes>>, and store None for those elements. However that’s ideally an implementation detail hidden from the public API.

    assert_eq!(24, size_of::<crate::Attributes>());

Ultimately though, if memory usage for large documents is important enough you should consider using something other than Kuchiki.

In my view Kuchiki is mostly a proof-of-concept to show what integrating the html5ever with and selectors crates look like. Designing a tree data structure for this is full of trade-offs. I tried to make Kuchicki is not terrible at any use case, but that means it’s also often not great.

In particular:

If you don’t care about removing some nodes from the tree and reclaiming memory before you drop the whole tree, use an area allocator and &'arena Node or a per-document Vec and integer indices. Rc<Node> has 16 bytes overhead per node.
If you don’t care about documents with more than four billion nodes, 32-bit integer indices
If you don’t care about mutating any node after the initial parsing, remove all RefCells. Those have 8 bytes overhead each.
If you do, replacing RefCell with Cell also removes that overhead at the cost of a less convenient API. Cell::update is still unstable but it’s fairly simple and doesn’t use unsafe, so duplicating it (or a variation of it that relies on T: Default and Cell::take instead of T: Copy and Cell::get) is easy and not risky.
If you don’t need CSS selector matching or anything else that goes "up" the tree from a given node, consider children: Vec<NodeRef> or children: Box<[NodeRef]> instead of five links for parent/previous sibling/next sibling/first child/last child.
If you do, that last one might still be viable if you manipulate not just references to a given node but "cursors" that contain a stack of indices into children sequences, forming the path from the root of the tree to a given node.

And I’m sure there are other interesting data structure ideas I haven’t thought of.

jyn514 · 2020-07-02T19:06:13Z

I think this should have been the very first step of this discussion.

That would have been a good start, sorry about that.

With this I think it becomes apparent that boxing doctypes is a win. Processing instructions as well, if we get element data below 56 bytes.

🎉 🎉

If memory usage for large documents is important enough you should consider using something other than Kuchiki. In my view Kuchiki is mostly a proof-of-concept.

I understand your viewpoint. I'm not aware of any alternatives to Kuchiki though and it's about 2k lines of code - It'd be a shame to rewrite all that just for docs.rs. I'm not sure what the best way forward is.

jdm · 2020-07-02T19:28:31Z

@jyn514 To be fair, based on https://github.com/rust-lang/docs.rs/blob/2edd72e2ad317ff8725db1dfe303cb8bc4918255/src/utils/html.rs#L6-L36 there's a lot of kuchiki's implementation that you wouldn't need to replicate. You're basically looking at:

https://github.com/kuchiki-rs/kuchiki/blob/master/src/parser.rs
the parts of https://github.com/kuchiki-rs/kuchiki/blob/master/src/tree.rs that are used by the parser
https://github.com/kuchiki-rs/kuchiki/blob/master/src/attributes.rs
https://github.com/kuchiki-rs/kuchiki/blob/master/src/serializer.rs

Any remaining bits could be replaced by manually walking the resulting parsed tree to find elements matching head and body, as far as I can see.

jdm · 2020-07-02T19:30:08Z

Alternatively, https://crates.io/crates/lol_html might be an option that better meets your needs out of the box.

jyn514 · 2020-07-02T20:12:41Z

lol_html looks like exactly what we want. Thanks!

Box most NodeData variants

8628c1f

This reduces memory usage in NodeData by a factor of 5 (80 -> 16), and usage in Node by a factor of 2 (120 -> 56).

SimonSapin mentioned this pull request Jul 2, 2020

NodeData is unnecessarily large, bloating the size of all Nodes #73

Closed

jyn514 closed this Jul 2, 2020

jyn514 mentioned this pull request Jul 2, 2020

Switch from html5ever to lol_html rust-lang/docs.rs#876

Closed

jyn514 deleted the moving-time branch April 20, 2021 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Box most NodeData variants #74

Box most NodeData variants #74

jyn514 commented Jul 1, 2020

SimonSapin commented Jul 2, 2020

jyn514 commented Jul 2, 2020

jyn514 commented Jul 2, 2020 •

edited

SimonSapin commented Jul 2, 2020

jyn514 commented Jul 2, 2020

jdm commented Jul 2, 2020

jdm commented Jul 2, 2020

jyn514 commented Jul 2, 2020

Box most NodeData variants #74

Box most NodeData variants #74

Conversation

jyn514 commented Jul 1, 2020

SimonSapin commented Jul 2, 2020

jyn514 commented Jul 2, 2020

jyn514 commented Jul 2, 2020 • edited

SimonSapin commented Jul 2, 2020

jyn514 commented Jul 2, 2020

jdm commented Jul 2, 2020

jdm commented Jul 2, 2020

jyn514 commented Jul 2, 2020

jyn514 commented Jul 2, 2020 •

edited