Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filename normalisation of form-data/multipart file uploads (umlauts on Apple clients) #2625

Merged
merged 2 commits into from Dec 13, 2022

Conversation

Tronic
Copy link
Member

@Tronic Tronic commented Dec 13, 2022

Normalise filenames to Unicode NFC, such that Mac and iOS clients behave identically to other operating systems. Normally Apple devices use NFD which may cause trouble on other systems.

This patch only affects form-data/multipart file uploads, not downloads nor any uploads handled by client side Javascript (which may need additional normalisation by app developers). The problem only affects filenames (not text input fields and such which always seem to use NFC).

This affects specifically umlauts and other letters that may be decomposed as two characters. For instance, ä can be represented as \u00E4 (NFC) or as a\u0308 (NFD) i.e. a with COMBINING DIARESIS.

Some applications may already be doing such normalisation, and should not be affected (only the same work done twice). Otherwise without this PR applications see differently encoded names depending on which OS the client is running, but this patch removes the disparity, and the changes are not expected to be breaking.

On the downloading side of things Mac browsers appear to be accepting either NFC or NFD, converting them to NFD as is native for Apple devices, thus no changes are needed there. NFC should work for everything.

A bit of a background on the filesystem side (not directly affecting Sanic)

# Linux sees NFC and NFD as separate files
>>> open("foo\u00E4", "w").write("NFC\n")
>>> open("fooa\u0308", "w").write("NFD\n")

# ls foo*
fooä  fooä

# cat foo*
NFD
NFC

MacOS considers them the same, although everything should be in NFD. The above code creates a file with the first created (NFC) filename, with the content of the second file (overwrite). Reading files or overwriting existing files understands either form from the filesystem and preserves it as well (similar to case-insensitivity and probably a side effect of implementing that). Reading filenames off the filesystem on (glob etc) returns whichever form is present on disk (which on MacOS often is NFD, and which then might need to be converted to NFC for interoperation with other systems over the web).

Those who already find incorrectly encoded filenames on their system should find the convmv utility helpful, as it can mass convert names either way:

convmv -f UTF-8 -t UTF-8 --nfc ...

@Tronic Tronic requested a review from a team as a code owner December 13, 2022 03:40
@codecov
Copy link

codecov bot commented Dec 13, 2022

Codecov Report

Base: 88.568% // Head: 88.513% // Decreases project coverage by -0.055% ⚠️

Coverage data is based on head (0046d2e) compared to base (92e7463).
Patch coverage: 100.000% of modified lines in pull request are covered.

Additional details and impacted files
@@              Coverage Diff              @@
##              main     #2625       +/-   ##
=============================================
- Coverage   88.568%   88.513%   -0.056%     
=============================================
  Files           81        81               
  Lines         6631      6634        +3     
  Branches      1129      1130        +1     
=============================================
- Hits          5873      5872        -1     
- Misses         522       524        +2     
- Partials       236       238        +2     
Impacted Files Coverage Δ
sanic/request.py 94.527% <100.000%> (+0.041%) ⬆️
sanic/server/websockets/impl.py 37.557% <0.000%> (-0.922%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ahopkins ahopkins merged commit 13e9ab7 into main Dec 13, 2022
@ahopkins ahopkins deleted the filename_normalization branch December 13, 2022 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants