Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Starting with "<-" are Ignored #126

Open
akshayphilar opened this issue Oct 25, 2018 · 3 comments
Open

Text Starting with "<-" are Ignored #126

akshayphilar opened this issue Oct 25, 2018 · 3 comments

Comments

@akshayphilar
Copy link
Contributor

akshayphilar commented Oct 25, 2018

Text starting with "<-" within the HTML body is completely ignored, examples follow.

Note: XML tag names starting with a hyphen are invalid as per the W3C XML spec

Example 1

>>> html = '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title></title><div>Release Date</div></body></html>'

Example 2

>>> html = '<html><body><title><-Thor></title></body></html>'
>>> Selector(html).extract()
'<html><body><title></title></body></html>'

Example 3

>>> html = '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title>Avengers-&gt;</title><div>Release Date</div></body></html>'
@akshayphilar akshayphilar changed the title HTML Tags Name Beginning with a "-" are Ignored Content in HTML Tags Beginning with a "-" are Ignored Oct 25, 2018
@akshayphilar akshayphilar changed the title Content in HTML Tags Beginning with a "-" are Ignored Text within HTML Tags Beginning with a "-" are Ignored Oct 25, 2018
@akshayphilar akshayphilar changed the title Text within HTML Tags Beginning with a "-" are Ignored Text Starting with "<-" are Ignored Oct 25, 2018
@Gallaecio
Copy link
Member

It’s invalid HTML, nonetheless.

I wonder if any of the suggested alternative parsers support it…

@sortafreel
Copy link
Contributor

sortafreel commented Oct 20, 2019

@Gallaecio @akshayphilar Only lxml doesn't support it, both Python html.parser and html5lib do. Still, not sure how to fix it if still using lxml, they're ignoring some similar bugs (like tag replacement) for ages :)

In [1]: from bs4 import BeautifulSoup                                                                                                

In [2]: html_1 = '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'                                      

In [3]: html_2 = '<html><body><title><-Thor></title></body></html>'                                                                  

In [4]: html_3 = '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'                         
   ...:                                                                                                                              

In [7]: soup1_hp = BeautifulSoup(html_1, "html.parser")                                                                              

In [8]: soup1_lxml = BeautifulSoup(html_1, "lxml")                                                                                   

In [9]: soup1_html5 = BeautifulSoup(html_1, "html5lib")                                                                              

In [10]: soup2_hp = BeautifulSoup(html_2, "html.parser")                                                                             

In [11]: soup2_lxml = BeautifulSoup(html_2, "lxml")                                                                                  

In [12]: soup2_html5 = BeautifulSoup(html_2, "html5lib")                                                                             

In [13]: soup3_hp = BeautifulSoup(html_3, "html.parser")                                                                             

In [14]: soup3_lxml = BeautifulSoup(html_3, "lxml")                                                                                  

In [15]: soup3_html5 = BeautifulSoup(html_3, "html5lib")                                                                             

In [16]: html_1                                                                                                                      
Out[16]: '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'

In [17]: soup1_hp                                                                                                                    
Out[17]: <html><body><title>&lt;-Avengers-&gt;</title><div>Release Date</div></body></html>

In [18]: soup1_lxml                                                                                                                  
Out[18]: <html><body><title></title><div>Release Date</div></body></html>

In [19]: soup1_html5                                                                                                                 
Out[19]: <html><head></head><body><title>&lt;-Avengers-&gt;</title><div>Release Date</div></body></html>

In [20]: html_2                                                                                                                      
Out[20]: '<html><body><title><-Thor></title></body></html>'

In [21]: soup2_hp                                                                                                                    
Out[21]: <html><body><title>&lt;-Thor&gt;</title></body></html>

In [22]: soup2_lxml                                                                                                                  
Out[22]: <html><body><title></title></body></html>

In [23]: soup2_html5                                                                                                                 
Out[23]: <html><head></head><body><title>&lt;-Thor&gt;</title></body></html>

In [24]: html_3                                                                                                                      
Out[24]: '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'

In [25]: soup3_hp                                                                                                                    
Out[25]: <html><body><title>&lt;-<span>Avengers</span>-&gt;</title><div>Release Date</div></body></html>

In [26]: soup3_lxml                                                                                                                  
Out[26]: <html><body><title>Avengers-&gt;</title><div>Release Date</div></body></html>

In [27]: soup3_html5                                                                                                                 
Out[27]: <html><head></head><body><title>&lt;-&lt;span&gt;Avengers&lt;/span&gt;-&gt;</title><div>Release Date</div></body></html>

@Gallaecio
Copy link
Member

We’ll probably have to support alternative parsers. Doing so would solve a handful of issues currently reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants