Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gateway does not run Unicode Normalization Forms leading to seemingly identical paths not resolving when using different non normalized strings #10286

Open
3 tasks done
Griss168 opened this issue Jan 10, 2024 · 12 comments
Labels
help wanted Seeking public contribution on this issue kind/bug A bug in existing code (including security flaws) kind/enhancement A net-new feature or improvement to an existing feature kind/feature A new feature P2 Medium: Good to have, but can wait until someone steps up

Comments

@Griss168
Copy link

Griss168 commented Jan 10, 2024

Checklist

Installation method

ipfs-desktop

Version

Kubo version: 0.25.0
Repo version: 15
System version: amd64/darwin
Golang version: go1.21.5

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "AppendAnnounce": [
      "/ip4/150.23.2.29/udp/12345/quic-v1"
    ],
    "Gateway": "/ip4/127.0.0.1/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/127.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/udp/12345/quic-v1",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false
    }
  },
  "DontCheckOSXFUSE": true,
  "Experimental": {
    "FilestoreEnabled": true,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": true,
    "OptimisticProvide": false,
    "OptimisticProvideJobsPoolSize": 0,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "DeserializedResponses": null,
    "DisableHTMLErrors": null,
    "ExposeRoutingAPI": null,
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": ""
  },
  "Identity": {
    "PeerID": "censored"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "48h",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": [
      {
        "Addrs": [
          "/ip4/10.0.1.101/udp/4001/quic"
        ],
        "ID": "censored"
      }
    ]
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {},
  "Routing": {
    "AcceleratedDHTClient": false,
    "Methods": null,
    "Routers": null
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "2m0s",
      "HighWater": 128,
      "LowWater": 64,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Hello,

I'm using IPFS-Desktop, but that's not important.

I'll create some folder called "Test" and put some random files in it with these file names:
A-5x01 Tíha 1.txt
A-5x02 Tíha 2.txt
A-5x03 Fenomén strachu.txt
A-5x04 Rozklad anděla.txt
A-5x05 Stěna ztracených duší.txt
A-5x06 Časová smyčka.txt

When I add the "Test" folder to IPFS, they give me the Test folder's CID QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie.
Now I have the file path like this "/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01 Tíha 1.txt" for each file.

If the file or path contains any characters like this ÁáÄäÉéĚěÍíÓóÔôÚúŮůÝýČčďťŇňŘřŠšŽž, the URL-encoded link can be represented in two ways:
No URL-encoded path
http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01 Tíha 1.txt
Url encoded by adding ́ symbol (0xCC81 in UTF8) after i. This url can be represented by http gateway server.
http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01%20Ti%CC%81ha%201.txt
Url encoded by adding í symbol (0xC3AD in UTF8) to the url. This url can't be represented by http gateway server.
http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x01%20T%c3%adha%201.txt

These special symbols are commonly used in the Czech and Slovak languages and are found in many files and folders.

I've tried both url-encoded formats on some random Apache web server and they can represent both links. I'm trying some apps that can download files from http paths, but they use the second encoding method and can't find the file on the http gateway.

Test files are there: https://ipfs.io/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/

I test this behavior on IPFS-Desktop 0.32.0 for Windows and MacOS, on kubo 0.25.0 for MacOS and also https://ipfs.io/ gateway.

I hope it will be useful.

@Griss168 Griss168 added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jan 10, 2024
@Jorropo
Copy link
Contributor

Jorropo commented Jan 11, 2024

@Griss168 thx a lot for the great report.
I am able to get it working using % encoding:

$ curl -L -vvv http://127.0.0.1:8080/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x03%20Fenome%CC%81n%20strachu.txt%20
*   Trying 127.0.0.1:8080...
* Connected to 127.0.0.1 (127.0.0.1) port 8080
> GET /ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x03%20Fenome%CC%81n%20strachu.txt%20 HTTP/1.1
> Host: 127.0.0.1:8080
> User-Agent: curl/8.5.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Access-Control-Allow-Headers: Content-Type
< Access-Control-Allow-Headers: Range
< Access-Control-Allow-Headers: User-Agent
< Access-Control-Allow-Headers: X-Requested-With
< Access-Control-Allow-Methods: GET
< Access-Control-Allow-Methods: HEAD
< Access-Control-Allow-Methods: OPTIONS
< Access-Control-Allow-Origin: *
< Access-Control-Expose-Headers: Content-Length
< Access-Control-Expose-Headers: Content-Range
< Access-Control-Expose-Headers: X-Chunked-Output
< Access-Control-Expose-Headers: X-Ipfs-Path
< Access-Control-Expose-Headers: X-Ipfs-Roots
< Access-Control-Expose-Headers: X-Stream-Output
< Cache-Control: public, max-age=29030400, immutable
< Content-Length: 168
< Content-Type: text/plain; charset=utf-8
< Etag: "QmVht3ZMMcuf4nCsjysExEvvFppUCSZSNc6fxmBquBkJMf"
< X-Ipfs-Path: /ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x03 Fenomén strachu.txt
< X-Ipfs-Roots: QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie,QmVht3ZMMcuf4nCsjysExEvvFppUCSZSNc6fxmBquBkJMf
< Date: Thu, 11 Jan 2024 07:52:39 GMT
< 
A-5x01 Tíha 1.txt
A-5x02 Tíha 2.txt
A-5x03 Fenomén strachu.txt
A-5x04 Rozklad anděla.txt
A-5x05 Stěna ztracených duší.txt
* Connection #0 to host 127.0.0.1 left intact
A-5x06 Časová smyčka.txt

I don't think non % encoded is supported on any correct HTTP server, browsers sometime un % encode the URL they show to users.
See how in network tab % encoded request shows up even tho address bar does not:
Screenshot from 2024-01-11 08-55-29

I can also browse the file you had issues with:
Screenshot from 2024-01-11 09-00-27

Maybe firefox is doing something chrome is not doing ? what browser are you using please ?

Last thing, I noticed some of your files had trailing space:

> ipfs block get QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie | protoc --decode=PBNode unixfs.proto
Data {
  Type: Directory
}
Links {
  Hash: "\022 e\0001\233\0373O\252l\017\215\3241\370@\211V\002\330*\333\2249]\314\217#\t\337\373\233\t"
  Name: "A-5x01 Ti\314\201ha 1.txt"
  Tsize: 678
}
Links {
  Hash: "\022 mr\025v?\032\017\362K\202\017\304@\034\346U\320\332\264\032\310\216\007\264\020w\306\336\354T\366\352"
  Name: "A-5x02 Ti\314\201ha 2.txt "
  Tsize: 179
}
Links {
  Hash: "\022 mr\025v?\032\017\362K\202\017\304@\034\346U\320\332\264\032\310\216\007\264\020w\306\336\354T\366\352"
  Name: "A-5x03 Fenome\314\201n strachu.txt "
  Tsize: 179
}
Links {
  Hash: "\022 mr\025v?\032\017\362K\202\017\304@\034\346U\320\332\264\032\310\216\007\264\020w\306\336\354T\366\352"
  Name: "A-5x04 Rozklad ande\314\214la.txt "
  Tsize: 179
}
Links {
  Hash: "\022 mr\025v?\032\017\362K\202\017\304@\034\346U\320\332\264\032\310\216\007\264\020w\306\336\354T\366\352"
  Name: "A-5x05 Ste\314\214na ztraceny\314\201ch dus\314\214i\314\201.txt "
  Tsize: 179
}
Links {
  Hash: "\022 mr\025v?\032\017\362K\202\017\304@\034\346U\320\332\264\032\310\216\007\264\020w\306\336\354T\366\352"
  Name: "A-5x06 C\314\214asova\314\201 smyc\314\214ka.txt"
  Tsize: 179
}

@Jorropo
Copy link
Contributor

Jorropo commented Jan 11, 2024

After checking it seems the on wire string could be utf8 but should not:

From RFC3986:

Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI.

From RFC7230:

A recipient MUST parse an HTTP message as a sequence of octets in an encoding that is a superset of US-ASCII [USASCII]. Parsing an HTTP message as a stream of Unicode characters, without regard for the specific encoding, creates security vulnerabilities due to the varying ways that string processing libraries handle invalid multibyte character sequences that contain the octet LF (%x0A). String-based parsers can only be safely used within protocol elements after the element has been extracted from the message, such as within a header field-value after message parsing has delineated the individual fields.

@Griss168
Copy link
Author

Griss168 commented Jan 11, 2024

URL with spaces is not a problem because browser etc. they always encode it.

The problem is that, the UTF8 string can be encoded to this URL
https://ipfs.io/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x06%20C%cc%8casova%cc%81%20smyc%cc%8cka.txt
but also this URL
https://ipfs.io/ipfs/QmdmqdwE1ZWzPKJVWyDvLqNY2PLedaeLQfoKuJSViHGpie/A-5x06%20%c4%8casov%c3%a1%20smy%c4%8dka.txt
and after decoding both of the url, you will get the same file name "A-5x06 Časová smyčka.txt". This means that you will get the same file from the gateway (web server) at the booth URL. But this doesn't work.

Snímka obrazovky 2024-01-11 o 10 34 46

By the way,

I have no idea how the spaces got to the end of some files. I just created a file in Sublime-text, saved it to disk with different names and drag-and-drop it to ipfs-desktop.

@Griss168
Copy link
Author

This is the same file name with different UTF8 encoding:
Snímka obrazovky 2024-01-11 o 10 55 26-utf8

@Jorropo
Copy link
Contributor

Jorropo commented Jan 11, 2024

@Griss168 I see now thx, this is not a decoding issue.

So the two links literally have different binary representation:

41 2d 35 78 30 36 20 43 30c 61 73 6f 76 61 301 20 73 6d 79 63 30c 6b 61 2e 74 78 74 
41 2d 35 78 30 36 20 10c 61 73 6f 76 e1 20 73 6d 79 10d 6b 61 2e 74 78 74 
A - 5 x 0 6   C ̌ a s o v a ́   s m y c ̌ k a . t x t 
A - 5 x 0 6   Č a s o v á   s m y č k a . t x t 

The first string (and the one you uploaded to your Kubo node) uses multi-codepoint-graphemes, it encodes the file name using boring old latin letters and it then apply accent modifiers on it:

030C ̌ COMBINING CARON
= hacek, V above
• Pinyin: marks Mandarin Chinese third tone
→ 02C7 ˇ caron

The second one use a codepoint which is literally the letter with the accent (in a single codepoint):

010C Č LATIN CAPITAL LETTER C WITH CARON
≡ 0043 C 030C ̌

Kubo works on binary, it does not even know that file names are text. So because the binary representation don't match it complains.

What you are asking us to do is to run Unicode Normalization Forms:

This annex describes normalization forms for Unicode text. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.

However this is annoying to implement and have security implications (because various implementations might not agree on how to resolve files), I understand the need for people to be able to share file names in their own languages, but I think this needs to be looked over in the ipfs/specs repo first.

I'll create an issue there and send it to our gateway experts.

@Jorropo Jorropo changed the title Incorrect URL decoding of some characters on local gateway 127.0.0.1:8080 Gateway does not run Unicode Normalization Forms leading to seemingly identical paths not resolving when using different non normalized strings. Jan 11, 2024
@Jorropo Jorropo changed the title Gateway does not run Unicode Normalization Forms leading to seemingly identical paths not resolving when using different non normalized strings. Gateway does not run Unicode Normalization Forms leading to seemingly identical paths not resolving when using different non normalized strings Jan 11, 2024
@Jorropo
Copy link
Contributor

Jorropo commented Jan 11, 2024

As an alternative to avoid security implications, we could still show an error page but add a link to the matching representation inside the file on the gateway.
So silicon brains would still error but meat brains would understand what happen and could click the correct link.

@Griss168
Copy link
Author

@Jorropo Thank You. Yes, that's what I was trying to explain.

I hope it will be resolved. Perhaps it will be possible to use some existing solution in the form of libraries, as it is used by web servers, for example.

As an alternative to avoid security implications, we could still show an error page but add a link to the matching
representation inside the file on the gateway.
So silicon brains would still error but meat brains would understand what happen and could click the correct link.

I don't think this is a good solution. It is applicable if the gateway is exclusively used by humans. But my use case is to use a 3rd party app to download files over http. It will not be able to understand that the given files can be found in another place.

@lidel
Copy link
Member

lidel commented Jan 12, 2024

Interesting! Polish has a bunch of diacritics such as ąęćłśźżóś but I've never experienced them being represented with ASCII + modifier rather than a single UTF8 code.

@Griss168 for the sake of prioritization, how common (real world) this problem is? Is this just this one specific software/website producing filenames in a weird notation, or a daily occurrence for you? Which notation is more common in your language? Normalised one?

We could fixup UX problem of HTTP 404 here by adding extra step of retrying on "not found" scenarios as suggested in ipfs/specs#457 (comment) (Kubo already does this type fo retry on subdomain gateways, it check for _redirects file, we could add unicode retry before that)

@Griss168
Copy link
Author

After your explanation, I did some more tests today and found that the problem is somewhere deeper. Incorrect representation of UTF8 characters is only a consequence.

I created a test file /Test/Návrat.txt on a USB flash drive with FAT32.
When I added the folder Test to ipfs on MacOS, I get:
./ipfs add -r /Volumes/NO\ NAME/Test
added QmT9T9VGigBFdAP9aB456iZqTTwx8dMWL2CA4BYKZKDNkA Test/Návrat.txt
added QmeAZQdGtHbd5bB2pyJYRwKLFV2WjZFADNvdxr7fiNkz9B Test

and then I added the same folder to ipfs on Windows from the same USB drive:
ipfs.exe add -r G:\Test
added QmT9T9VGigBFdAP9aB456iZqTTwx8dMWL2CA4BYKZKDNkA Test/Návrat.txt
added QmeKnwTydTvBqBnV9NjYfLzuB9bqSWiAiVYGs8NBGhkipD Test

As you can see, I got a different CID for the Test folder from the same data on different OS.

If I then compare the URL for the Návrat.txt file that is generated on the gateway, I get:
On Windows - http://127.0.0.1:8080/ipfs/QmeKnwTydTvBqBnV9NjYfLzuB9bqSWiAiVYGs8NBGhkipD/N%C3%A1vrat.txt
On MacOS - http://127.0.0.1:8080/ipfs/QmeAZQdGtHbd5bB2pyJYRwKLFV2WjZFADNvdxr7fiNkz9B/Na%CC%81vrat.txt

At the very beginning, I tried using a torrent client to download the torrent data via the webseed distribution method from the ipfs gateway. The torrent itself was created on Windows. I uploaded the data from the original torrent to the ipfs daemon on MacOS. Subsequently, I added the url from the gateway to the torrent client on MacOS as a webseed. Webseed reported "File not found" on the torrent client. When I used Wireshark to inspect http requests, I found a difference between the URL from the torrent client and the gateway in the diacritics representation. Torrent client has file path and name defined in UTF8 and requests use the same representation of UTF8 characters as defined in the file structure.

It looks like the ipfs daemon on MacOS changes file characters from the one-character representation á (0xC3A1) to the two-character representation a ́ (0xCC81).

@aschmahmann aschmahmann added P2 Medium: Good to have, but can wait until someone steps up need/maintainers-input Needs input from the current maintainer(s) and removed need/triage Needs initial labeling and prioritization labels Feb 6, 2024
@lidel
Copy link
Member

lidel commented Feb 6, 2024

Thank you for digging into this across different operating systems.

My understanding of the problem here is that this is not a problem with Kubo or IPFS.
IPFS does not touch your encoding, UnixFS filenames are just bytes, you put there whatever you want.
Same with Kubo, we get filename from operating system via system API as bytes, and store it in UnixFS, we don't do any normalization, we don't mutate user data.

👉 It is macOS being a special snowflake with their NFD normalizations:

This means that macOS Finder and some APIs and tools often change characters with diacritical marks (like accents) to be represented using a base character followed by a separate combining diacritical mark (NFD instead of NFC normalization everyone else uses).

This is a well known headache with MacOS, some examples:

@Griss168 this is to say, if Kubo (golang) doing ipfs add gets already normalized name from macOS system API,
there is not much we can do. We don't know what was the original, nor we can force normalization by default, because hashes will change plus some users expect NFD and we dont want to break them either).

The only idea I have for dealing with import is that we could use golang.org/x/text/unicode/norm and add optional flag to force unicode noralization during data import:

$ ipfs add --normalize-names none|nfc|nfd # opt-in, no normalization by default

This way, users could force specific normalization like NFC when doing import on macOS, but only in cases like yours, when it matters.

This would be in addition to the fixup on gateway described in ipfs/specs#457 which is a band-aid for data that was imported by other people, or requested with invalid normalization.

Together, they would give end users enough to get to the data via gateway.

@lidel lidel added need/author-input Needs input from the original author and removed need/maintainers-input Needs input from the current maintainer(s) labels Feb 6, 2024
@Griss168
Copy link
Author

Griss168 commented Feb 6, 2024

@Griss168 this is to say, if Kubo (golang) doing ipfs add gets already normalized name from macOS system API, there is not much we can do. We don't know what was the original, nor we can force normalization by default, because hashes will change plus some users expect NFD and we dont want to break them either).

The only idea I have for dealing with import is that we could use golang.org/x/text/unicode/norm and add optional flag to force unicode noralization during data import:

$ ipfs add --normalize-names none|nfc|nfd # opt-in, no normalization by default

This way, users could force specific normalization like NFC when doing import on macOS, but only in cases like yours, when it matters. Would this be useful?

I think it's a great solution. This solves my problem of how to add data to the IPFS network on different systems and thus improve their availability in the network.

If not, we are left with fixup on gateway described in ipfs/specs#457 (which I think will be enough band-aid for data that was imported by other people).

But I think that this solution is also important because it solves the retrieval of data from the IPFS network. After some of my tests, it turned out that different http clients use different methods of normalization and url encoding.

Both solutions are important, although for my use case normalization during import is more important.

Thanks for not giving up :)

This comment was marked as resolved.

@lidel lidel added kind/enhancement A net-new feature or improvement to an existing feature kind/feature A new feature and removed need/author-input Needs input from the original author kind/stale labels Feb 13, 2024
@lidel lidel removed their assignment Feb 13, 2024
@lidel lidel added the help wanted Seeking public contribution on this issue label Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Seeking public contribution on this issue kind/bug A bug in existing code (including security flaws) kind/enhancement A net-new feature or improvement to an existing feature kind/feature A new feature P2 Medium: Good to have, but can wait until someone steps up
Projects
Status: 🥞 Todo
Development

No branches or pull requests

4 participants