Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for response with UTF-8 BOM #4976

Merged
merged 4 commits into from Aug 20, 2019

Conversation

eduardomourar
Copy link
Contributor

As per discussion in the PR Azure/msrest-for-python/#145, there some issues with server responses (specially from Microsoft) that have BOM in it. You can see errors in the tests from my branch here reproducing the same behavior when trying to parse both text and JSON.

This has been fixed by forcing encoding to utf-8-sig when HTTP header has signalized utf-8 or leave to chardet when no encoding has been identified. That way the parsing works as expected and no errors are thrown.

@codecov-io
Copy link

codecov-io commented Feb 9, 2019

Codecov Report

Merging #4976 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4976      +/-   ##
==========================================
+ Coverage   66.89%   66.94%   +0.04%     
==========================================
  Files          15       15              
  Lines        1577     1579       +2     
==========================================
+ Hits         1055     1057       +2     
  Misses        522      522
Impacted Files Coverage Δ
requests/models.py 77.45% <100%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c9309e4...19cff44. Read the comment docs.

@sigmavirus24
Copy link
Contributor

So I'm not certain this is actually the right place to fix this. I need to do more research to figure out how backwards compatible this actually is.

As a work-around, one can forcibly override encoding on a Response to deal with this. Especially in a library like you describe, if they know this is an issue they can (via their wrapper) do r.encoding = 'utf-8-sig' if r.encoding == 'utf-8' else r.encoding

@eduardomourar
Copy link
Contributor Author

@sigmavirus24 , thanks! We are using the approach you mentioned in the msrest library, but I believe other libraries will experience this issue. The issue being that RFC 7159 does not allow BOM and json python library throws an error rightfully so. I believe using utf-8-sig encode we will be normalizing the data and it will avoid errors when parsing as JSON, XML, etc.

@sigmavirus24
Copy link
Contributor

RFC 7159 does not allow BOM's on UTF8, correct. How does that apply to XML?

@sigmavirus24
Copy link
Contributor

Aha, to quote the codecs documentation:

To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. As it’s rather improbable that any charmap encoded file starts with these byte values (which would e.g. map to

    LATIN SMALL LETTER I WITH DIAERESIS
    RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    INVERTED QUESTION MARK

in iso-8859-1), this increases the probability that a utf-8-sig encoding can be correctly guessed from the byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding. On encoding the utf-8-sig codec will write 0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.

The important part here is that utf-8-sig is special in that it will ignore Microsoft's nonsense that they implemented against the specification. (On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file.) This leads me to believe (probably incorrectly) that this would be backwards compatible with what we had previously.

@eduardomourar
Copy link
Contributor Author

RFC 7159 does not allow BOM's on UTF8, correct. How does that apply to XML?

It doesn't apply directly. But, if you take a look into the test log from my branch (https://travis-ci.com/eduardomourar/requests/jobs/176565327), an error will be thrown when running response.text. And usually that would be the input for libraries such as ElementTree.

@eduardomourar
Copy link
Contributor Author

Regarding backward compatibility, I believe we are covered too. The code will influence as little as possible because we are not replacing Response.encoding we are only temporary switching it when getting the decoded content. The only issue that user might find is that the size will be 3 bytes less than before (like you mentioned).

@eduardomourar
Copy link
Contributor Author

@sigmavirus24 , is there anything needed here from my side?

@eduardomourar
Copy link
Contributor Author

@kennethreitz, @sigmavirus24 and @nateprewitt , could you help with this PR, please?

@kennethreitz
Copy link
Contributor

fantastic work!

@kennethreitz kennethreitz merged commit a71e224 into psf:master Aug 20, 2019
@eduardomourar eduardomourar deleted the feature/strip-utf8-bom branch January 26, 2020 18:38
nateprewitt added a commit to nateprewitt/requests that referenced this pull request Feb 18, 2020
This reverts commit 19cff44.
This reverts commit 9e27326.
This reverts commit f507a3e.
sethmlarson pushed a commit that referenced this pull request Feb 19, 2020
This reverts commit 19cff44.
This reverts commit 9e27326.
This reverts commit f507a3e.
aless10 pushed a commit to aless10/requests that referenced this pull request Feb 19, 2020
This reverts commit 19cff44.
This reverts commit 9e27326.
This reverts commit f507a3e.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants