Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapy can not auto detect GBK html encoding #155

Open
samuelchen opened this issue Mar 5, 2020 · 3 comments
Open

Scrapy can not auto detect GBK html encoding #155

samuelchen opened this issue Mar 5, 2020 · 3 comments
Labels

Comments

@samuelchen
Copy link

Hi,

Thanks you guys for the great framework.

I am using scrapy to crawl multiple sites. Sites are diffrerent encodings.
One site is encoding as 'gbk' and it's declared in HTML meta. but scrapy can not auto detect the encoding.

I tried using Beautiful soup, it can parse it correctly. So I dig into w3lib. found that the pattern
_BODY_ENCODING_BYTES_RE can not correctly found the encoding in meta.

HTML snippet as below:

b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'

my test :

>>> from w3lib.encoding import html_body_declared_encoding
>>> b
b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
>>> html_body_declared_encoding(b)
>>> enc = html_body_declared_encoding(b)
>>> enc
>>> print('"%s"' % enc)
"None"
>>> soup = BeautifulSoup(b)
>>> soup.title
<title>网站地图</title>
>>> soup.original_encoding
'gbk'
>>>
@Gallaecio Gallaecio added the bug label Mar 12, 2020
@kostalski
Copy link

Hi @samuelchen @Gallaecio ,

Source of encoding detection problem seems to be in invalid input HTML it self not in w3lib. There is invalid HTML meta tag. There is <meta httpequiv="ContentType" ..., but valid (with w3c) should be <meta http-equiv="Content-Type" ...(missing dash character). Because of that w3lib is not detecting defined encoding.

beautifulsoup4 is detecting 'gbk' encoding, because it is using naive regex for fallback encoding detection (lib: beautifulsoup4 file: bs4/dammit.py, line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]').

For @samuelchen problem w3lib can be updated to be more forgiving/lenient. Updating (lib: w3lib, file: w3lib/encoding.py)
From: _HTTPEQUIV_RE = _TEMPLATE % ('http-equiv', 'Content-Type')
To: _HTTPEQUIV_RE = _TEMPLATE % (r'http-?equiv', r'Content-?Type')

After this fix w3lib would detected encoding as gb18030. This should have no side effects, but I don't know if it is right way ;)
What you think @Gallaecio ?

More details below.


Details

I was able to reproduce issue with provided settings:

  • Python 3.7.9
  • libs:
    -- beautifulsoup4==4.9.3
    -- html5lib==1.1
    -- lxml==4.6.1
    -- w3lib==1.22.0

Test python script:

from w3lib.encoding import html_body_declared_encoding
from bs4 import BeautifulSoup

b = b'<HTML>\r\n <HEAD>\r\n  <TITLE>\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc</TITLE>\r\n  <meta httpequiv="ContentType" content="text/html; charset=gbk" />\r\n  <META NAME="Keywords" CONTENT="\xe5\xd0\xa1\xcb\xb5,\xcd\xf8\xd5\xbe\xb5\xd8\xcd\xbc">'
enc = html_body_declared_encoding(b)
print("html_body_declared_encoding: %s" % enc)

for parser in ['html5lib', 'html.parser', 'lxml']:
    soup = BeautifulSoup(b, parser)
    print("soup.original_encoding[parser:{}]: {}".format(parser, soup.original_encoding))

Script output:

html_body_declared_encoding: None
soup.original_encoding[parser:html5lib]: windows-1252
soup.original_encoding[parser:html.parser]: windows-1252
soup.original_encoding[parser:lxml]: gbk

Detection by Beatifulsoup only for 'lxml' parser, by fallback encoding detection.
lib: beautifulsoup4
file: bs4/dammit.py,
line: html_meta = '<\\s*meta[^>]+charset\\s*=\\s*["\']?([^>]*?)[ /;\'">]'

@samuelchen
Copy link
Author

samuelchen commented Nov 22, 2020

@kostalski Thank you for the feedback. I am not able to recall why that html was httpequiv="ContentType". Not sure if it is possible to be coverted by other parts of scrapy or it's original. I am sorry about this, too long ago to remember that.
btw. GB18030 is compatible with GBK.

@kostalski
Copy link

Ok @samuelchen, no problem 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants