Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InnerStartIndex value is wrong in nested elements when sequence is escaped. #508

Open
cmhernandezdel opened this issue Jul 24, 2023 · 1 comment
Assignees

Comments

@cmhernandezdel
Copy link

1. Description

InnerStartIndex value is wrong in nested elements when sequence is escaped.

3. Fiddle or Project

Provide a Fiddle that reproduce the issue: https://dotnetfiddle.net/zi7dBK

If you change line 14 to this:

var html = """<p class="test"><span class="text">Hola, soy Carlos. <br> Encantado <a href="#popup">de ayudarte</a>.</span></p>""";

It works.

4. Any further technical details

  • HAP version: 1.11.50.
  • NET version: net6.0.
@JonathanMagnan JonathanMagnan self-assigned this Jul 24, 2023
@elgonzo
Copy link
Contributor

elgonzo commented Jul 24, 2023

Well, it's entirely unclear what InnerStartIndex, and OuterStartIndex for that matter are supposed to denote.

Note that the documentation comments for HtmlNode.InnerStartIndex and HtmlNode.OuterStartIndex

/// <summary>
/// Gets the stream position of the area between the opening and closing tag of the node, relative to the start of the document.
/// </summary>
public int InnerStartIndex

and

/// <summary>
/// Gets the stream position of the area of the beginning of the tag, relative to the start of the document.
/// </summary>
public int OuterStartIndex

say "Gets the stream position". What does that precisely mean? The wording, especially "stream" is not really aligning well with "parsed HTML document" but rather the source of the parsed HTML document. Then again, no streams involved here in your example -- the source itself being a string -- but look at the HtmlDocument.LoadHtml(string) implementation, and you notice a StringReader being created for the source string, which gives me the impression "stream" here in this context is meant to be "source"...

But it gets more complicated and a bit messy. Note the public HtmlDocument.Text field (without any meaningful documentation comments), which seems to provide the original un-parsed source text and not a representation of the parsed document as HtmlNode.OuterHtml does. See here for an illustrative example: https://dotnetfiddle.net/N9pNLp

InnerStartIndex and OuterStartIndex seem to correspond with the string in HtmlDocument.Text.

And then there is also the HtmlDocument.ParsedText property:

/// <summary>Gets the parsed text.</summary>
/// <value>The parsed text.</value>
public string ParsedText
{
get { return Text; }
}

which claims to provide the parsed text, but in reality is just a proxy property for HtmlDocument.Text which seems to provide the un-parsed source text. No idea what's up with the HtmlDocument.Text field and the HtmlDocument.ParsedText property and how it is supposed to be, but something isn't right with these two..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants