Skip to content

Commit

Permalink
fix(jruby): improve how Node#line is calculated
Browse files Browse the repository at this point in the history
Note that Java's w3c.dom.Node interface does not have any way to
get (or set) the line number for that node. The original JRuby
implementation counted newlines, but did it poorly -- see #1223.

This commit improves the newline-counting approach. But the solution
itself -- counting newlines! -- is still questionable IMHO and
absolutely inefficient.

I had played around with an approach that I wrote about at #1223,
where the SAX Parser knows what line it's on when `startElement` is
invoked (via the XMLLocator). But I couldn't figure out how to
preserve that information in the final Document or Node.

If you, like me, think this approach is terrible; and if you *also*
understand how to set this metadata on the Node or the Document, then
please help us out.
  • Loading branch information
flavorjones committed Jan 18, 2021
1 parent 369e6bd commit b32c875
Show file tree
Hide file tree
Showing 3 changed files with 48 additions and 28 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Expand Up @@ -12,6 +12,11 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA
* [JRuby] `{XML,HTML}::Document.parse` now invokes `#initialize` exactly once. Previously `#initialize` was not called, which was a problem for subclassing such as done by `Loofah`.


### Improved

* [JRuby] Update the algorithm used to calculate `Node#line` to be wrong less-often. The underlying parser, Xerces, does not track line numbers, and so we've always used a hacky solution for this method. [[#1223](https://github.com/sparklemotion/nokogiri/issues/1223)]


## v1.11.1 / 2021-01-06

### Fixed
Expand Down
20 changes: 14 additions & 6 deletions ext/java/nokogiri/XmlNode.java
Expand Up @@ -1406,14 +1406,22 @@ private boolean count(Node node, int[] counter) {
if (node == this.node) {
return true;
}

NodeList list = node.getChildNodes();
for (int i=0; i<list.getLength(); i++) {
Node n = list.item(i);
if (n instanceof Text
&& ((Text)n).getData().contains("\n")) {
counter[0] += 1;
for (int jchild=0; jchild<list.getLength(); jchild++) {
Node child = list.item(jchild);

if (child instanceof Text) {
String text = ((Text)child).getData();
int textLength = text.length();
for (int jchar = 0; jchar < textLength; jchar++) {
if (text.charAt(jchar) == '\n') {
counter[0] += 1;
}
}
}
if (count(n, counter)) return true;

if (count(child, counter)) return true;
}
return false;
}
Expand Down
51 changes: 29 additions & 22 deletions test/xml/test_node.rb
Expand Up @@ -1055,28 +1055,6 @@ def test_namespace_without_an_href_on_html_node
assert_nil(node.namespaces['xmlns:o'])
end

def test_line
xml = Nokogiri::XML(<<~eoxml)
<root>
<a>
Hello world
</a>
</root>
eoxml

set = xml.search("//a")
node = set.first
assert_equal(2, node.line)
end

def test_set_line
skip("Only supported with libxml2") unless Nokogiri.uses_libxml?
document = Nokogiri::XML::Document.new
node = document.create_element('a')
node.line = 54321
assert_equal(54321, node.line)
end

def test_xpath_results_have_document_and_are_decorated
x = Module.new do
def awesome!; end
Expand Down Expand Up @@ -1230,6 +1208,35 @@ def test_wrap
assert_equal('wrapper', thing.parent.name)
assert_equal('thing', doc.at_css("wrapper").children.first.name)
end

describe "#line" do
it "returns a sensible line number for each node" do
xml = Nokogiri::XML(<<~eoxml)
<a>
<b>
Hello world
</b>
<b>
Goodbye world
</b>
</root>
eoxml

set = xml.search("//b")
assert_equal(2, set[0].line)
assert_equal(5, set[1].line)
end
end

describe "#line=" do
it "overrides the line number of a node" do
skip("Xerces does not have line numbers for nodes") unless Nokogiri.uses_libxml?
document = Nokogiri::XML::Document.new
node = document.create_element('a')
node.line = 54321
assert_equal(54321, node.line)
end
end
end
end
end
Expand Down

0 comments on commit b32c875

Please sign in to comment.