Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gateway: run Unicode Normalisation Forms on path gateway inputs #457

Open
Jorropo opened this issue Jan 11, 2024 · 2 comments
Open

gateway: run Unicode Normalisation Forms on path gateway inputs #457

Jorropo opened this issue Jan 11, 2024 · 2 comments
Labels
kind/discussion Topical discussion; usually not changes to codebase P3 Low: Not priority right now

Comments

@Jorropo
Copy link
Contributor

Jorropo commented Jan 11, 2024

See context here: ipfs/kubo#10286 (comment)
Relevant Unicode spec: https://unicode.org/reports/tr15/

@Jorropo Jorropo added the need/triage Needs initial labeling and prioritization label Jan 11, 2024
@hacdias
Copy link
Member

hacdias commented Jan 11, 2024

For reference: https://go.dev/blog/normalization

@lidel
Copy link
Member

lidel commented Jan 12, 2024

Thank you for raising this.
We operate under ecosystem constraints:

  • UnixFS specification (Publish UnixFS specifications at specs.ipfs.tech #331) never normalised filenames (opaque strings)
  • We can't blindly run normalisation before resolving content path
    • It would break access to data that has filenames in non-normalized notation.
  • We also can't make an arbitrary decision to change the filenames while onboarding data.
    • There may be datasets which interlink and use different notation, and forcing normalization during onboarding to IPFS would break links in applications that operate on the data.

What is the problem we are trying to solve?
My understanding of linked issue is user copying "non-normalised" content path from somewhere, and getting "not found" error because DAG uses noralised filenames (notation mismatch).

If so, I think the best we could do UX-wise, is to retry on "not found" and trying normalised (NFC) / decomposed (NFD) forms (to cover both variants).

This way we don't break datasets where file already exists, but still fix HTTP 404 for cases where only file in different notation exists.

If this is something we want to do, should be included in #453 to ensure consistency across web contexts (which we will then reference from https://specs.ipfs.tech/http-gateways/path-gateway/).

But this introduces a magical behavior which hides the underlying problem macOS introduced – see my comment in ipfs/kubo#10286 (comment).

Perhaps it is better to NOT fix reads, and instead give users ability to force specific normalization during data onboarding instead? (like ipfs add --normalize-names none|nfd|nfc suggested in ipfs/kubo#10286 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/discussion Topical discussion; usually not changes to codebase P3 Low: Not priority right now
Projects
None yet
Development

No branches or pull requests

3 participants