Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test issue caused by change in Requests library #1762

Closed

Conversation

MaybeNetwork
Copy link
Contributor

Feature Summary and Justification

Requests recently changed the library they use for the detection of content encoding. If you are running Python 3, and chardet, the old library they used, is not present on your system, the new library charset_normalizer is used. Unfortunately, charset_normalizer has some trouble detecting the content encoding when the content size is small, which causes the test TestSubreddit.test_submit_image_large to fail. The solution offered here is to specify the content-encoding in the affected test.

Here is an example of the failed test, which happened in 1760. That PR just happened to be the first one made after the test environment was updated, and the change suggested in it was not responsible for the test's failure.

In the affected test,

reddit.subreddit("test").submit_image("test", tempfile.name)

is fed a mock response, constructed as

response = requests.Response()
response._content = mock_data.encode("utf-8")
response.status_code = 400

Subreddit.submit_image calls Subreddit._submit_media, which calls Subreddit._parse_xml_response, which takes response defined above and sets

xml = response.text

Since the character encoding in response wasn't specified, Requests's text property checks the apparent encoding. The test environment doesn't have chardet, so charset_normalizer is called to check the encoding, and it incorrectly guesses that the encoding is UTF-16BE. The response is then decoded with the wrong encoding, which xml.etree.ElementTree.XML fails to parse.

Here is a script that shows more details:

import chardet
import charset_normalizer

def main():
    mock_data = (
        '<?xml version="1.0" encoding="UTF-8"?>'
        "<Error>"
        "<Code>EntityTooLarge</Code>"
        "<Message>Your proposed upload exceeds the maximum allowed size</Message>"
        "<ProposedSize>20971528</ProposedSize>"
        "<MaxSizeAllowed>21971520</MaxSizeAllowed>"
        "<RequestId>23F056D6990D87E0</RequestId>"
        "<HostId>iYEVOuRfbLiKwMgHt2ewqQRIm0NWL79uiC2rPLj9P0PwW554MhjY2/O8d9JdKTf1iwzLjwWMnGQ=</HostId>"
        "</Error>"
    )
    chardet_detection = chardet.detect(mock_data.encode('utf-8'))
    charset_normalizer_detection = charset_normalizer.detect(mock_data.encode('utf-8'))
    print(chardet_detection)
    print(mock_data.encode('utf-8').decode(chardet_detection['encoding']))
    print(charset_normalizer_detection)
    print(mock_data.encode('utf-8').decode(charset_normalizer_detection['encoding']))

if __name__ == "__main__":
    main()

which produces this output:

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
<?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>20971528</ProposedSize><MaxSizeAllowed>21971520</MaxSizeAllowed><RequestId>23F056D6990D87E0</RequestId><HostId>iYEVOuRfbLiKwMgHt2ewqQRIm0NWL79uiC2rPLj9P0PwW554MhjY2/O8d9JdKTf1iwzLjwWMnGQ=</HostId></Error>
{'encoding': 'utf_16_be', 'language': '', 'confidence': 0.867}
㰿硭氠癥牳楯渽∱⸰∠敮捯摩湧㴢啔䘭㠢㼾㱅牲潲㸼䍯摥㹅湴楴祔潯䱡牧攼⽃潤放㱍敳獡来㹙潵爠灲潰潳敤⁵灬潡搠數捥敤猠瑨攠浡硩浵洠慬汯睥搠獩穥㰯䵥獳慧放㱐牯灯獥摓楺放㈰㤷ㄵ㈸㰯偲潰潳敤卩穥㸼䵡硓楺敁汬潷敤㸲ㄹ㜱㔲〼⽍慸卩穥䅬汯睥搾㱒敱略獴䥤㸲㍆〵㙄㘹㤰䐸㝅〼⽒敱略獴䥤㸼䡯獴䥤㹩奅噏畒晢䱩䭷䵧䡴㉥睱兒䥭ぎ坌㜹畩䌲牐䱪㥐ぐ睗㔵㑍桪夲⽏㡤㥊摋呦ㅩ睺䱪睗䵮䝑㴼⽈潳瑉搾㰯䕲牯爾

You can see that the last line is the same bizarre Chinese string shown in the failed test.

References

@bboe
Copy link
Member

bboe commented Jul 16, 2021

I see that this gets the test to pass. Does Reddit's actual response, if the request were issued, include the actual encoding?

@MaybeNetwork
Copy link
Contributor Author

I see that this gets the test to pass. Does Reddit's actual response, if the request were issued, include the actual encoding?

No, it does not. Another option is leave the test as it was before, and force the response to be decoding as UTF-8 by changing

def _parse_xml_response(self, response: Response):
"""Parse the XML from a response and raise any errors found."""
xml = response.text

to

def _parse_xml_response(self, response: Response):
    """Parse the XML from a response and raise any errors found."""
    response.encoding = "utf-8"
    xml = response.text

This also makes the test pass, and would help users that don't have chardet. There may be downsides to this, though, as I don't know if Reddit always returns that particular response in UTF-8.

@MaybeNetwork
Copy link
Contributor Author

Charset_normalizer updated to version 2.0.3 several hours after I submitted this, and the test no longer fails.

There may be people out there who have versions <2.0.3, which, in addition to the troubles described above, have a bug that causes the library to fail to detect any encoding at all. I encountered that bug while running tests on an actual response returned by Reddit, so it's possible that others could run into it too.

I restored the test, and added the one line that forces utf-8 instead. If that feature isn't worth adding, this PR can be closed.

@github-actions
Copy link

github-actions bot commented Aug 6, 2021

This PR is stale because it has been open for 20 days with no activity. Remove the Stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale Issue or pull request has been inactive for 20 days label Aug 6, 2021
@github-actions
Copy link

This PR was closed because it has been stale for 10 days with no activity.

@github-actions github-actions bot added the Auto-closed - Stale Automatically closed due to being stale for too long label Aug 16, 2021
@github-actions github-actions bot closed this Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auto-closed - Stale Automatically closed due to being stale for too long Stale Issue or pull request has been inactive for 20 days
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants