Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line number capped at 65535 but libxml2 already returns long. #1617

Closed
pakore opened this issue Mar 17, 2017 · 8 comments
Closed

Line number capped at 65535 but libxml2 already returns long. #1617

pakore opened this issue Mar 17, 2017 · 8 comments

Comments

@pakore
Copy link

pakore commented Mar 17, 2017

ext/nokogiri/xml_node.c:1263 contains the line:

return INT2NUM(xmlGetLineNo(node));

but the signature of xmlGetLineNo according to libxml2 2.8.0 (tree.c:4527) is

long
xmlGetLineNo(xmlNodePtr node)

so libxml2 is returning a long that is passed to a macro function that accepts an int , so long is converted to int, and hence capped to 65535

The solution would be to change ext/nokogiri/xml_node.c:1263 to

return LONG2NUM(xmlGetLineNo(node));

and thus a problem that's been around for some years would be fixed.

@flavorjones
Copy link
Member

The act of casting this return value from a long to an int is not the cause of line number overflow.

First, let's take a look at the relevant libxml2 struct:

/**
 * xmlNode:
 *
 * A node in an XML tree.
 */
typedef struct _xmlNode xmlNode;
typedef xmlNode *xmlNodePtr;
struct _xmlNode {
    void           *_private;	/* application data */
    xmlElementType   type;	/* type number, must be second ! */
    const xmlChar   *name;      /* the name of the node, or the entity */
    struct _xmlNode *children;	/* parent->childs link */
    struct _xmlNode *last;	/* last child link */
    struct _xmlNode *parent;	/* child->parent link */
    struct _xmlNode *next;	/* next sibling link  */
    struct _xmlNode *prev;	/* previous sibling link  */
    struct _xmlDoc  *doc;	/* the containing document */

    /* End of common part */
    xmlNs           *ns;        /* pointer to the associated namespace */
    xmlChar         *content;   /* the content */
    struct _xmlAttr *properties;/* properties list */
    xmlNs           *nsDef;     /* namespace definitions on this node */
    void            *psvi;	/* for type/PSVI informations */
    unsigned short   line;	/* line number */
    unsigned short   extra;	/* extra data for XPath/XSLT */
};

You can see that internally, libxml2 stores line number as an unsigned short.

Second, the size of int on most platforms is 4 bytes:

int main() {
  printf("sizeof(int) = %d\n", sizeof(int));
}

outputs

sizeof(int) = 4

But thanks for thinking of us! And thanks for using Nokogiri.

@pakore
Copy link
Author

pakore commented Mar 17, 2017

Mmm I see. And changing that in libxml2 breaks ABI. Too bad.

I guess they return long already although they have to cast from ushort to prepare the interface for when they can change the struct.

In this case nokogiri should also be already prepared and respect the return value of xmlGetLineNo, IMHO (unless it has performance issues). So when you update to a libxml2 with the fixed line, you won't have to remember to change this line in nokogiri and it will be fixed automatically, don't you think?
Summarizing: there are 2 bugs now (1 in nokogiri, 1 in libxml2), and we can have just 1 bug (the one in libxml2).

@flavorjones
Copy link
Member

I love your enthusiasm, @pakore, but there is in fact no bug here. If you can write a failing test, I'll reconsider my opinion.

@pakore
Copy link
Author

pakore commented Mar 17, 2017

I can, but I would have to mock xmlGetLineNo to return a proper long, because as of today, it never returns a long higher than 65535.

The bug is as simple as this: You are casting a long to an int in:
return INT2NUM(xmlGetLineNo(node));

You can get away with it so far because you know the long returned is never going to be over 65535 (due to the bug in libxml2). The moment they start to return a proper long, then you will have the bug as long as the line number is higher than MAX_INT. It may be tomorrow, it may be in 5 years. The question is if you want to take a reactive action or a proactive action about it.

I may try to write a unit test if i can get around mocking that function to return something higher than MAX_INT. I understand this is not urgent, but since the fix is immediate you can just include it in the next release.

Anyway, thanks for reading my messages and replying, appreciated!

@pakore
Copy link
Author

pakore commented Mar 17, 2017

Anyway, if there is an XML with more than 2147483648 lines....you deserve a bug :).

@knu
Copy link
Member

knu commented Mar 17, 2017

I guess we actually should use UINT2NUM().

@flavorjones
Copy link
Member

@knu @pakore I apologize for saying so, but given the underlying limitations of libxml2, there's no bug here, and libxml2 maintainers are on the record saying that they won't change this behavior until 2.0 because of the required ABI change.

@knu, if you want to change this, you're within your rights as a committer. but we can't write a failing test for this, so I honestly don't know why we're still talking about it.

flavorjones added a commit that referenced this issue Aug 14, 2021
- set BIG_LINES parse option by default which will allow Node#line to return large integers
- allow Node#line= to set large line numbers on text nodes

Fixes #1764, #1493, #1617, #1505, #1003, #533
flavorjones added a commit that referenced this issue Aug 14, 2021
feat(cruby): support line numbers larger than a short

---

**What problem is this PR intended to solve?**

As noted in #1493, #1617, #1505, #1003, and #533, libxml2 has not historically supported line numbers greater than a `short int`. Starting in libxml v2.9.0, setting the parse option `BIG_LINES` would allow tracking line numbers in longer documents.

Specifically this PR makes the following changes:

- set `BIG_LINES` parse option by default which will allow `Node#line` to return large integers
- allow `Node#line=` to set large line numbers on text nodes

Fixes #1764 

**Have you included adequate test coverage?**

Yes!

**Does this change affect the behavior of either the C or the Java implementations?**

JRuby's Xerces-based implementation did not suffer from this particular shortcoming, although its line number functionality is questionable in other ways (see #2177 / b32c875).
@flavorjones
Copy link
Member

This will be fixed in v1.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants