Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some nodes are lost #487

Closed
wpyok500 opened this issue Oct 6, 2022 · 6 comments
Closed

Some nodes are lost #487

wpyok500 opened this issue Oct 6, 2022 · 6 comments

Comments

@wpyok500
Copy link

wpyok500 commented Oct 6, 2022

Here is what to include in your request to make sure we implement a solution as quickly as possible.

1. Description

Some nodes are lost

2. Exception

url: http://www.weather.com.cn/weather/101230203.shtml

QQ截图20221006230441

图片1

Exception message:
Stack trace:

3. Fiddle or Project

If you are able,

Provide a Fiddle that reproduce the issue: https://dotnetfiddle.net/25Vjsn

Or provide a project/solution that we can run to reproduce the issue.

  • Make sure the project compile
  • Make sure to provide only the code that is required to reproduce the issue, not the whole project
  • You can send private code here: info@zzzprojects.com

Otherwise, make sure to include as much information as possible to help our team to reproduce the issue.

4. Any further technical details

Add any relevant detail can help us, such as:

  • HAP version:
  • NET version (net45, .net core 3.1, etc.)
    net472
@JonathanMagnan
Copy link
Member

Hello @wpyok500 ,

You are comparing HAP with the HTML rendered (so after some javascript has been run).

If you check the source code, you will see the node also have no value:

<em>
<span title="无持续风向" class="NNW"></span>
<span title="无持续风向" class="NNW"></span>
</em>

See the following answer to have more information about this kind of issue: #482 (comment)

There are 2 way to solve it:
Use LoadFromBrowser if you are still with .NET Framework: https://html-agility-pack.net/from-browser
Use Selenium library that will allow you to interact with the HTML (so once the page is loaded).

Best Regards,

Jon


Sponsorship
Help us improve this library

Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework ExtensionsDapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval Function

@wpyok500
Copy link
Author

wpyok500 commented Oct 7, 2022

@JonathanMagnan
The tag

<i><3级</i>

exists when HtmlDocument.LoadHtml() is used, and the

<i><3级</i>

tag is lost after using SelectNodes .
The same is true with HtmlWeb()

           WebClient web = new WebClient();
            web.Encoding = Encoding.UTF8;
            var t = web.DownloadDataTaskAsync("http://www.weather.com.cn/weather/101230203.shtml");
            string html = Encoding.UTF8.GetString(t.Result);

            //var web2 = new HtmlWeb();
            //var doc = web2.Load("http://www.weather.com.cn/weather/101230203.shtml");
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html); //  The document contains <i><3级</i> tags
            HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//*[@id=\"7d\"]/ul"); // 有问题:丢失风力 <3级  Missing <i><3级</i> tags after SelectNodes
            //CSS选择器需要再nuget: HtmlAgilityPack.CssSelectors
            HtmlNode hnode = doc.DocumentNode.QuerySelector(".t.clearfix");  // 有问题:丢失风力 <3级
            HtmlNode hnode1 = doc.DocumentNode.QuerySelector(".c7d");

@wpyok500
Copy link
Author

wpyok500 commented Oct 7, 2022

LoadFromBrowser() does not exist

image

@elgonzo
Copy link
Contributor

elgonzo commented Oct 7, 2022

HtmlAgilityPack does not simply lose the content of the <i> nodes when using SelectNodes.

Proof: https://dotnetfiddle.net/rtBnoM

Note the <i><3级</i> appearing in the output of the inner html of the selected <ul> node.

If you somehow still see the problem with SelectNodes, double-check and make sure you are using the most current HtmlAgilityPack version.

@JonathanMagnan
Copy link
Member

Hello @wpyok500 ,

Sorry, you are indeed right, the text is part of the source code, I checked the wrong part of the source.

However, @elgonzo is also right; in the latest version, no bug seems to exist. Here is another Fiddle copied from the one from @elgonzo that also do the QuerySelector part, and all the 3 output have no problem returning the value: https://dotnetfiddle.net/hfma4k

Best Regards,

Jon

@wpyok500
Copy link
Author

wpyok500 commented Oct 8, 2022

@elgonzo @JonathanMagnan hello

I found that there is a problem of losing the i tag when using 1.11.46. After rolling back 1.11.45, the i tag will not be lost. After upgrading to 1.11.46 again, there is no such problem.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants