explore: make NodeSet a subclass of Array #2184

flavorjones · 2021-02-01T14:06:36Z

NodeSet today

Over the years, NodeSet has slowly approached being API-compatible with Enumerable or Array. This is good, and it validates the mental model of libxml2's xmlNodeSet as an augmented ordered set, especially given that the underlying implementation is literally a C array:

typedef struct _xmlNodeSet xmlNodeSet;
typedef xmlNodeSet *xmlNodeSetPtr;
struct _xmlNodeSet {
    int nodeNr;			/* number of nodes in the set */
    int nodeMax;		/* size of the array as allocated */
    xmlNodePtr *nodeTab;	/* array of nodes in no particular order */
    /* @@ with_ns to check whether namespace nodes should be looked at @@ */
};

However, we find ourselves at an interesting point, where NodeSet is not completely an Enumerable or Array, and there are open issues pointing this out:

NodeSet does not follow ruby conventions for enumerable methods · Issue #1677 · sparklemotion/nokogiri

Further, NodeSet has baggage, namely the associated Document object which makes simple operations harder:

or even causes bugs:

segfault in node_set.rb · Issue #1952 · sparklemotion/nokogiri

Finally, the NodeSet class is bigger and more complex than necessary (in both CRuby and JRuby), and so is a bit of a maintenance burden at this point.

NodeSet Tomorrow

As mentioned in #1952, it would be simpler if NodeSet was a subclass of Array, which would free us from using libxml2's xmlNodeSet and unify the JRuby and CRuby implementations

The memory model could be updated so that it was independent of any Document, thereby bringing it into alignment with the memory model of all the standard Ruby collection classes.

The Enumerable API would be perfectly conformed to.

The API would be extended with Searchable to support current API usage.

The API could also implement Document decorators at creation time by optionally inheriting them from an existing NodeSet or the creating Document. Decorators are a rarely-used and ill-documented feature which I suspect is buggy and would be improved by moving to a simpler implementation.

DocumentFragment tomorrow

Finally, this opens the door to a long-time roadmap item, which is to re-implement DocumentFragment on top of NodeSet, thereby avoiding use of libxml2's underlying conventions (and further unifying the JRuby and CRuby implementations). This would further be a simplifying change and would potentially allow us to fix the quirks with how XPath searches work in fragments differently than in Documents and NodeSets.

Risks

Primarily, the risks are:

GC implementation correctness
Potentially, unexpected memory usage - enabled by the fix to segfault in node_set.rb #1952

The first risk exists because we'd be making an invasive change to the current codebase which has been tested thoroughly by many applications over many years. This can be mitigated by continuing to run valgrind in the CI suite, and potentially extending coverage to use ASan. We may want to consider implementing a new class entirely to allow applications the ability to "flip back to the previous implementation" at runtime if any surprising problems occur (i.e., by setting an environment variable or global constant before Nokogiri is loaded).

The second risk exists because a NodeSet may now contain nodes from many documents, and the highly-connected DOM graph may then mean that many unused objects would be prevented from being GCed. This perhaps shouldn't be surprising to anyone who's thought deeply about directed graphs.

The text was updated successfully, but these errors were encountered:

flavorjones · 2021-11-26T20:55:59Z

I spent a little bit of time spiking on this today and got pretty far on getting the test suite to pass. I haven't turned on valgrind checks yet, but I'm optimistic that this might be easier than I suspected.

flavorjones · 2022-01-10T17:58:36Z

Notes to self: work needed on the branch:

document NodeSet
delete C code
explore JRuby code deletion
memory: test with valgrind/memcheck
memory: think hard about leak scenarios (in the leak test suite maybe)

flavorjones added the topic/memory Segfaults, memory leaks, valgrind testing, etc. label Feb 1, 2021

flavorjones added this to the v2.0.0 milestone Feb 1, 2021

flavorjones mentioned this issue Feb 1, 2021

segfault in node_set.rb #1952

Closed

flavorjones mentioned this issue Jan 11, 2022

[bug] Regression in 1.13.0 in DocumentFragment#css? #2419

Closed

flavorjones mentioned this issue Aug 6, 2022

[draft] NodeSet is an Array #2617

Draft

flavorjones modified the milestones: v2.0.0, v1.15.0 Jan 16, 2023

flavorjones modified the milestones: v1.15.0, v1.16.0 Apr 28, 2023

flavorjones modified the milestones: v1.16.0, v1.16.x patch releases, v1.17.0 Dec 28, 2023

This was referenced May 1, 2024

use template as the context node flavorjones/nokogiri-html5-inference#7

Merged

Inference.parse always returns a NodeSet for fragments flavorjones/nokogiri-html5-inference#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explore: make NodeSet a subclass of Array #2184

explore: make NodeSet a subclass of Array #2184

flavorjones commented Feb 1, 2021

flavorjones commented Nov 26, 2021

flavorjones commented Jan 10, 2022

explore: make NodeSet a subclass of Array #2184

explore: make NodeSet a subclass of Array #2184

Comments

flavorjones commented Feb 1, 2021

NodeSet today

NodeSet Tomorrow

DocumentFragment tomorrow

Risks

flavorjones commented Nov 26, 2021

flavorjones commented Jan 10, 2022