Fix for response with UTF-8 BOM #4976

eduardomourar · 2019-02-09T21:30:47Z

As per discussion in the PR Azure/msrest-for-python/#145, there some issues with server responses (specially from Microsoft) that have BOM in it. You can see errors in the tests from my branch here reproducing the same behavior when trying to parse both text and JSON.

This has been fixed by forcing encoding to utf-8-sig when HTTP header has signalized utf-8 or leave to chardet when no encoding has been identified. That way the parsing works as expected and no errors are thrown.

codecov-io · 2019-02-09T21:41:38Z

Codecov Report

Merging #4976 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4976      +/-   ##
==========================================
+ Coverage   66.89%   66.94%   +0.04%     
==========================================
  Files          15       15              
  Lines        1577     1579       +2     
==========================================
+ Hits         1055     1057       +2     
  Misses        522      522

Impacted Files	Coverage Δ
requests/models.py	`77.45% <100%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9309e4...19cff44. Read the comment docs.

sigmavirus24 · 2019-02-11T12:59:30Z

So I'm not certain this is actually the right place to fix this. I need to do more research to figure out how backwards compatible this actually is.

As a work-around, one can forcibly override encoding on a Response to deal with this. Especially in a library like you describe, if they know this is an issue they can (via their wrapper) do r.encoding = 'utf-8-sig' if r.encoding == 'utf-8' else r.encoding

eduardomourar · 2019-02-12T01:35:32Z

@sigmavirus24 , thanks! We are using the approach you mentioned in the msrest library, but I believe other libraries will experience this issue. The issue being that RFC 7159 does not allow BOM and json python library throws an error rightfully so. I believe using utf-8-sig encode we will be normalizing the data and it will avoid errors when parsing as JSON, XML, etc.

sigmavirus24 · 2019-02-12T13:16:20Z

RFC 7159 does not allow BOM's on UTF8, correct. How does that apply to XML?

sigmavirus24 · 2019-02-12T13:23:09Z

Aha, to quote the codecs documentation:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

    LATIN SMALL LETTER I WITH DIAERESIS
    RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    INVERTED QUESTION MARK

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

The important part here is that utf-8-sig is special in that it will ignore Microsoft's nonsense that they implemented against the specification. (On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.) This leads me to believe (probably incorrectly) that this would be backwards compatible with what we had previously.

eduardomourar · 2019-02-12T14:01:35Z

RFC 7159 does not allow BOM's on UTF8, correct. How does that apply to XML?

It doesn't apply directly. But, if you take a look into the test log from my branch (https://travis-ci.com/eduardomourar/requests/jobs/176565327), an error will be thrown when running response.text. And usually that would be the input for libraries such as ElementTree.

eduardomourar · 2019-02-12T14:17:03Z

Regarding backward compatibility, I believe we are covered too. The code will influence as little as possible because we are not replacing Response.encoding we are only temporary switching it when getting the decoded content. The only issue that user might find is that the size will be 3 bytes less than before (like you mentioned).

eduardomourar · 2019-03-14T21:41:27Z

@sigmavirus24 , is there anything needed here from my side?

eduardomourar · 2019-05-19T10:26:17Z

@kennethreitz, @sigmavirus24 and @nateprewitt , could you help with this PR, please?

kennethreitz · 2019-08-20T04:26:38Z

fantastic work!

This reverts commit 19cff44. This reverts commit 9e27326. This reverts commit f507a3e.

Eduardo Rodrigues added 3 commits February 9, 2019 11:56

add utf8 with bom to test

f507a3e

add failing tests for bom

9e27326

fix response with utf8 bom

19cff44

Merge branch 'master' into feature/strip-utf8-bom

4bce5a0

kennethreitz merged commit a71e224 into psf:master Aug 20, 2019

eduardomourar deleted the feature/strip-utf8-bom branch January 26, 2020 18:38

nateprewitt added a commit to nateprewitt/requests that referenced this pull request Feb 18, 2020

Revert "Fix for response with UTF-8 BOM psf#4976"

b159fc3

This reverts commit 19cff44. This reverts commit 9e27326. This reverts commit f507a3e.

nateprewitt mentioned this pull request Feb 18, 2020

Revert incomplete PRs #5353

Merged

sethmlarson pushed a commit that referenced this pull request Feb 19, 2020

Revert "Fix for response with UTF-8 BOM #4976"

fd13816

This reverts commit 19cff44. This reverts commit 9e27326. This reverts commit f507a3e.

aless10 pushed a commit to aless10/requests that referenced this pull request Feb 19, 2020

Revert "Fix for response with UTF-8 BOM psf#4976"

d3fac9e

This reverts commit 19cff44. This reverts commit 9e27326. This reverts commit f507a3e.

github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for response with UTF-8 BOM #4976

Fix for response with UTF-8 BOM #4976

eduardomourar commented Feb 9, 2019

codecov-io commented Feb 9, 2019 •

edited

sigmavirus24 commented Feb 11, 2019

eduardomourar commented Feb 12, 2019

sigmavirus24 commented Feb 12, 2019

sigmavirus24 commented Feb 12, 2019

eduardomourar commented Feb 12, 2019

eduardomourar commented Feb 12, 2019

eduardomourar commented Mar 14, 2019

eduardomourar commented May 19, 2019

kennethreitz commented Aug 20, 2019

Fix for response with UTF-8 BOM #4976

Fix for response with UTF-8 BOM #4976

Conversation

eduardomourar commented Feb 9, 2019

codecov-io commented Feb 9, 2019 • edited

Codecov Report

sigmavirus24 commented Feb 11, 2019

eduardomourar commented Feb 12, 2019

sigmavirus24 commented Feb 12, 2019

sigmavirus24 commented Feb 12, 2019

eduardomourar commented Feb 12, 2019

eduardomourar commented Feb 12, 2019

eduardomourar commented Mar 14, 2019

eduardomourar commented May 19, 2019

kennethreitz commented Aug 20, 2019

codecov-io commented Feb 9, 2019 •

edited