Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HtmlNode.InnerText Not working properly (JS get Working normally) maybe a bug? #317

Closed
AtlantisDe opened this issue Jul 17, 2019 · 17 comments
Assignees

Comments

@AtlantisDe
Copy link

AtlantisDe commented Jul 17, 2019

Description

The Html source : https://auto.qq.com/a/20120202/000205.htm

YOU can use this code test

var source = File.ReadAllText(@"D:\Tmp\auto.qq.com-001.html", Encoding.UTF8);
HtmlAgilityPack.HtmlDocument htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(source);
var content = htmlDocument.DocumentNode.SelectSingleNode("//div[contains(@class, 'bd')]");
var InnerText = content.InnerText;

Exception (HtmlAgilityPack.HtmlNode.innertext) like this img

 content.InnerText

//get InnerText not OK
var InnerText = content.InnerText;

Google Chrome console Js get is OK

  • Working normally
    avatar

Js Get

document.querySelector("#C-Main-Article-QQ > div.bd").innerText

pls help me

tks very much 
@JonathanMagnan JonathanMagnan self-assigned this Jul 17, 2019
@JonathanMagnan
Copy link
Member

Hello @AtlantisDe ,

Thank you for reporting, we will look at it.

Best Regards,

Jonathan


Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval FunctionSQL Eval Function

@JonathanMagnan
Copy link
Member

Hello @AtlantisDe ,

The v1.11.10 has been released.

Could you try it and let me know if this issue is correctly fixed on your side.

@Kinematics
Copy link

Just a note that something broke for me in 1.11.10 related to InnerText.

Code:

doc.Element("html").Element("head")?.Element("title")?.InnerText

That returns the expected text for the page title when run in 1.11.9, but an empty string in 1.11.10. 1.11.10 now puts the title in InnerHtml instead.

@AtlantisDe
Copy link
Author

Hello @AtlantisDe ,

The v1.11.10 has been released.

Could you try it and let me know if this issue is correctly fixed on your side.

thanks now is ok

@AtlantisDe
Copy link
Author

Just a note that something broke for me in 1.11.10 related to InnerText.

Code:

doc.Element("html").Element("head")?.Element("title")?.InnerText

That returns the expected text for the page title when run in 1.11.9, but an empty string in 1.11.10. 1.11.10 now puts the title in InnerHtml instead.

yes

here can get

var title_1 = htmlDocument.DocumentNode.Element("html").Element("head").Element("title").InnerHtml;

that's a broken

var title_2 = htmlDocument.DocumentNode.Element("html").Element("head").Element("title").InnerText;

@AtlantisDe AtlantisDe reopened this Jul 20, 2019
@JustArchi
Copy link

I can confirm what @Kinematics said above, 1.11.10 has fatal regression regarding InnerText and cases that worked fine previously no longer do. Please investigate.

JustArchi added a commit to JustArchiNET/ArchiSteamFarm that referenced this issue Jul 20, 2019
@JonathanMagnan
Copy link
Member

I removed the latest version from NuGet.

I will get it fixed on Monday.

Best Regards,

Jonathan

@JonathanMagnan
Copy link
Member

Hello all,

The v1.11.11 has been released.

Now only script and style are ignored in the InnerText

Let me know if that version is working as expected.

@AtlantisDe
Copy link
Author

Working normally Thank you very much....

@cyotek
Copy link

cyotek commented Jul 23, 2019

Apologies for replying to a closed issue. I've just updated to 1.11.11 and been bitten by this change with several tests failing, specifically in regards to the style element returning an empty string for InnerText. I've worked around the issue by simply checking if the type of the child itself is Text and if so, reading InnerText directly from the child. So, easy enough to work around but I was curious at the rational behind the change - I came across this issue whilst checking to see if anyone else had an issue or if I needed to file a new bug.

I tested opening a page in Firefox where a style element was present in a head element and then executing document.getElementsByTagName("head")[0].innerText and document.getElementsByTagName("style")[0].innerText in turn. The text for the head included the title text and the style element content. The text for style included the style. This matches the behaviour of 1.11.8 (I never got around to updated to 1.11.9 or 1.11.10) but not the behaviour of 1.11.11.

Next I tested another page which had JavaScript and doing innerText on the script object returned the actual JavaScript. Doing innerText on the parent container did not include the JavaScript. I don't have any tests which specifically examine the contents of script tags so I don't know if this is matches the old HAP behaviour or not.

I think therefore that potentially the new implementation is still flawed, at least in regards to style, as calling innerText directly on a script or style element in a browser console returns the content as expected. Calling it on a parent element containing either of these elements returns the CSS for style and nothing for script.

I tested this in Firefox 68.

Don't know if this is useful information or not, but I'm going to revert back to 1.11.8 until I know if I really need to start examining style elements differently or not.

Thanks;
Richard Moss

@JonathanMagnan
Copy link
Member

Hello @cyotek ,

Thank you for reporting,

It looks you somewhat are right. The text in the script tag can appear in head InnerText but not in the body. It's not as simple as we show it or hide it... it depends on the parent tag.

We will look at it this week and try to have it work as the browser does.

@JustArchi
Copy link

JustArchi commented Jul 24, 2019

@JonathanMagnan I've tried latest 1.11.11 and it suffers from the same issue as .10.

In particular, I'm doing InnerText of //div[@class='pagecontent']/script in order to extract the script content for my usage. With last two releases it returns null there.

Maybe I'm doing something wrong or don't understand an issue, but this used to work until now. Let me know if you need some reproducible case, but I'm pretty sure this will happen with any InnerText of script. Alternatively, if this is intended then you should mark it with appropriate breaking change and offer proper rewrite, since personally I have no clue what I'm supposed to use instead.

@JonathanMagnan
Copy link
Member

Hello All,

The v1.11.12 has been released.

script and style text will only appear in InnerText from head, script and style node.

It fixes your issue @JustArchi , @cyotek

However, if we get some more error reported, we might just rollback all these changes or add an option to have the current behavior since this kind of change currently break some code which is not something we really love to do.

Let me know if everything now works as expected.

@JustArchi
Copy link

I'm not any less confused than I was before, but I can confirm that 1.11.12 works again for my use cases, thank you 😅.

@cyotek
Copy link

cyotek commented Jul 25, 2019

Hello,

I just updated to 1.11.12 and can confirm non of my tests have failed so all seems to be well regarding the new build. Hopefully it also addresses the OP's issue too!

Thanks again for the fast response and fix.

Regards;
Richard Moss

@Kinematics
Copy link

Likewise confirming that 1.11.12 is working fine for me.

@AtlantisDe
Copy link
Author

Hello All,i will close it...if u have any issue..pls reopen...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants