-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync reserved characters proposal #9539
Comments
a nitpick, before I forget it. I'll see if I have more substantial comments when I have more time. re url encoding and Samba's Catia mapping encoders: FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine. The Catia mapping also requires unicode support. But the filesystem specifically using UTF-8 is not required for any encoder. |
You may be right. I was going off of here, but it may be wrong. I will remove that claim.
Will update doc.
Sorry, I don't follow. Can you clarify? |
I think I got lost in the levels of inception of re-encoding, but I think this should be handled exactly the same way as the "case insensitive fs" wrapper. I agree that you can end up with cases where you switch between the different wrappers leading to unexpected effects, i.e., files that were claimed to be with In majority of the cases the codec should be a no-op, and that's fine, switching it back and fourth should have no effect, and will only matter for cases where you do have a genuine ":" in the paths, which should be very few cases. I guess the more interesting case that I don't see handled is where our encoding scheme clashes with files that already exist. Agreed, we can have helper cli utility that help "decode" or "encode" things in place to allow you to convert. |
I think taking a step back and defining the assumed invariants would be good before diving into details and an action plan.
Regarding point 3, I really like the idea of storing the encoding scheme with the data, under As to point 1, we do have some kind of encoders already in Syncthing: encrypted names on untrusted devices (not easily reversible) and the Unicode normalization code (also not reversible if the previous name was not normalized). Looking at those might give some hints regarding the invariance questions. Integrating that functionality with the proposed encoding stuff is probably too far fetched though. Thinking one step further, I could imagine even more radical encoders emerging, such as the mentioned base64 encoding. That might prove useful to implement further filesystem types in Syncthing, e.g. to add object stores. But then it needs to be clear whether this encoding machinery works with only a (non-reversible) hash function. Again, laying down these invariants / requirements for encoding schemes will help set the boundaries for designing the basic encoders we actually need in the first step. |
s/UTF-8/unicode/. It doesn't matter if a filesystem uses UTF8, they need to support unicode, any unicode encoding will do. |
The base filesystems only support 8.3 length non-unicode filenames, but Windows uses an extension to also store longer unicode filenames as an add-on. |
Thank you for writing this up, it's an excellent summary of the problem, your proposed solutions, and the potential issues. ❤️ For me, however, it also illustrates quite clearly why I'm disinclined to accept the proposal (and the corresponding PR). In a nutshell, the problem ("I want to sync filenames containing reserved/unsupported/special characters on any filesystem") is fairly easily avoided and/or corrected when it surfaces. The proposed solutions, however, are complicated and error prone, and the result of mistakes and misconfigurations much harder to reason about and fix than the original problem. In my mind this makes the cost higher than the benefit. |
Some small points to start with ...
|
Oh and as a counterpoint. The requirement seems to be that a particular host has all the files created on every other peer irrespective of the name it might be given here to overcome any local limitation. This is presumably useful for things like backup servers. In that case a translation like the previously mentioned base64 would be acceptable, BUT might still hit a file length limitation. Taking a secure hash (MD5, SHA1 etc) of the pathname would give a name with four or five 8 character sections for any original filename which is (basically) guaranteed to be unique. A small database containing a list of all the paths would be required to know what filenames are stored on the local FS. Working with the filesystem would be mostly trivial but there would be no method of migrating to or from this scheme except for adding another peer to the swarm. Though individual pathnames can be translated using simple tools like Personally I'm more likely to make the backup server a Linux box. |
The following proposal was inspired by @JanKanis' recent comments.
Goal
As a user, I want to sync filenames containing reserved/unsupported/special characters on any filesystem. For example, I want to sync filenames containing
/\"*:<>?|
characters on NTFS, exFAT, FAT32 and reFS filesystems, which disallows these characters in filenames. For more information, see #7876, which will be closed once this proposal is approved.Proposed solution
Each folder will be configured to use an encoder. Initially, there will be two encoders: "Default" and "FAT".
Both existing, and newly created, folders will work exactly as they do now, and will be configured to use the "Default" encoder. Calling it the "Default encoder" is really a misnomer, as it's really just the way Syncthing works right now. It's only called the "Default encoder" in order to differentiate it from other encoders. We could also call it "None", or "Passthrough encoder", but "Default" seems the best choice for non-technical users. The user can change this setting via folder options.
The encoders
The Default encoder
The Default encoder, as described above, is not really an encoder, as it reads and writes filenames on disk "as is," without any encoding. It does not reject or ignore any filenames it receives. It is designed to be used on filesystems that allow all characters except
/
andNUL
, but it can be used on any filesystem. If it's used on a FAT-based filesystem, filenames with reserved characters won't be able to be written to disk, leading to out-of-sync errors if it receives them.The FAT encoder
The FAT encoder is designed to be used on filesystems that disallow the characters
/\"*:<>?|
in filenames. When filenames with these characters are written to disk, the FAT encoder encodes the filename in a format that the filesystem will accept. When read from disk, the filename is then decoded before being sent. The FAT encoder can be used on any filesystem, but there is no reason to run it on a non-FAT filesystem.Potential issues
Here's a list of potential issues I can think of, and how we can address them:
1. A Default encoder finds an encoded filename on disk
This may occur when a user has a folder configured to use the FAT encoder, the folder contains encoded filenames, and the user then changes the folder's encoder back to Default (or they downgrade Syncthing). When the Default encoder reads these filenames from disk, it will log warning messages, but will send these filenames "as is."
2. A FAT encoder receives an encoded filename
This may occur when a Default encoder finds an encoded filename on disk (described above) and sends it to a FAT encoder.
When the FAT encoder receives encoded filenames, it will log a warning message, and ignore the request, leading to out-of-sync errors.
This issue is most likely a mis-configuration issue, where the sending folder should be using the FAT encoder, but it's using the Default encoder instead. But it's possible that a user has filenames that just happen to use the same characters that Syncthing uses to encode characters.
Phase 4 enhancement: A "FAT re-encoding" encoder
In phase 4, we might consider adding a FAT encoder to encode (or "re-encode") the encoded filename again. For example, if we are using the Private Use Unicode characters (
\xf000
-\xf0ff
) for encoding, we could re-encode\xf0xx
as\xf05cf0xx
(\xf05c
is an encoded backslash).For example, let's say the file
acolon:
was originally encoded asacolon\xf03a
. When the FAT encoder receives this encoded filename, it would re-encode it asacolon\xf05cf03a\xf03a
.When viewing this filename in a Windows Subsystem for Linux (WSL) or Cygwin shell, the file would appear to the user in a directory listing as literally
acolon\xf03a:
. In a GitBash or MSYS shell, it would appear as'acolon'$'\357\201\234'xf03a:
. This would visually indicate the filename was re-encoded.The GUI would display this new encoder as the "FAT (re-encoding)" encoder. The existing FAT encoded would still be called simply the "FAT" encoder.
3. A Default encoder receives an encoded filename
The Default encoder receives encoded filenames, and writes them to disk as-is, but will log warning messages.
Phase 3 enhancement: An "ignoring encoded" encoder
In phase 3, we might consider adding an encoder which is a Default encoder, but it rejects receiving encoded filenames, which will lead to out-of-sync errors. The GUI displays this as the "Ignore encoded filenames" encoder.
4. A Default encoder receives a re-encoded filename (phase 4 issue)
The Default encoder will send and receive re-encoded filenames, but will log warning messages.
The "Ignore encoded filenames" encoder created in phase 3 would ignore these re-encoded files.
5. A FAT encoder receives a re-encoded filename (phase 4 issue)
All FAT encoders reject re-encoded filenames that are received, which would cause an out-of-sync error for that folder. It could re-re-encode, but the odds of needing to re-encode are slim, so the odds of re-re-encoding are probably next to none.
6. A user switches a folder's encoder from FAT to Default via the GUI
In phase 5, Syncthing could scan the folder's database, to determine if the folder contains any encoded filenames. If there are, it could explain the ramifications of the switch, and ask the user to confirm their choice, before saving their change.
7. A user switches a folder's encoder from FAT to Default by editing config.xml
In phase 5, whenever a user changes a folder's encoder Syncthing could write a file to
.stfolder/syncthing.yml
in the folder's root. Or on startup, if the file doesn't exist, Syncthing creates it. This file will contain the entryencoder: default
orencoder: fat
.On a following startup, Syncthing compares this file with the folder's current setting, and if they're different, it logs a warning, and displays a warning in the GUI. It doesn't change the folder's encoder. It also doesn't rewrite the file, unless the user starts Syncthing with the
--fix-encoders
option.Recovering from folders containing both encoded and decoded filenames
The problem
Let's say we have three peers: D, F and G. All three share a folder and D and F use the Default encoder. Peer G uses the FAT encoder. Peer F's filesystem is FAT, and so it had an out-of-sync error on the file
acolon:
. Peer F switched its folder's encoder to FAT, which savedacolon:
asacolon\xf03a
, and the out-of-sync error went away.Now, peer F switched the folder's encoder from FAT, back to Default. The Default encoder will find the file
acolon\xf03a
on disk and sync this file to peer D, which will see it as a new file, and save it. Peer D now has two files namedacolon:
andacolon\xf03a
.Peer D will then sync these files with peer F. Peer F will still accept
acolon\xf03a
, but will rejectacolon:
as it has a reserved character.Peer D will also sync these files with peer G, which accepts them both, encoding
acolon:
asacolon\xf03a
, and re-encodingacolon\xf03a
asacolon\xf05cf03a\xf03a
.Possible solution
A separate CLI program is run on any peer where the folder is not on a FAT filesystem. It searches for files where encoded files (
acolon\xf03a
) coexist with their decoded equivalents (acolon:
). If a pair is found, it displays the two filenames, along with their timestamps, and asks the user to choose to keep:acolon\xf03a
(this would executemv -f 'acolon\xf03a' 'acolon:'
)acolon:
(this would executerm -f 'acolon\xf03a'
)Syncthing will then propagate the delete (1.), or the delete/update (2.) to the other peers. Peer F will delete
acolon\xf03a
. Peer F will still have an out-of-sync error as it can't write toacolon:
.Peer G will delete
acolon\xf03a
(which is re-encoded on disk asacolon\xf05cf03a\xf03a
), and updateacolon:
(encoded on disk asacolon\xf03a
) if applicable.Startup options could automate the selections:
--newest
- always select the newest file--oldest
- always select the oldest file--encoded
- always select the encoded filename--decoded
- always select the decoded filenameOther encoders
1. FAT (Windows) encoder
In phase 2, the existing FAT encoder will be enhanced to encode reserved/disallowed filenames (such as
NUL
, andadot.
). The encoding method will be compatible with GitBash, Windows Subsystem for Linux (WSL), Cygwin, MSYS, Linux's CIFS driver, etc. The filenames are somewhat incompatible in a Windows 11 Command Shell, as they must be referred to as.\NUL
or\\.\path\to\NUL
. Previous versions of Windows may require\\.\path\to\NUL
. The GUI displays this new encoder as "FAT (Windows)".Also in phase 2, the FAT encoder developed in phase 1 will be displayed as "FAT (Android)" by the GUI, as Android (and other Linux environments) allows filenames that are reserved in Windows, such as
NUL
.2. IOS encoder
In phase 5, we may want to add an IOS encoder, that encodes
:
, and leading periods (except for our.st{filter,ignore,versions}
files) as the Files app rejects files with colons and leading periods.3. Plan9 encoder
In phase 5, we may want to create a Plan9 encoder, that encodes the reserved characters
\x01-\x1f
and\x80-\x9f
.Summary of actions
Here is a summary of the actions described above, by phase:
Phase 1
We start with the Default and FAT encoders:
Encoder
Encoder
Phase 2
In phase 2, we add the FAT (Windows) encoder:
Encoder
(and Windows reserved filenames)
Phase 3
In phase 3, we add the "Ignoring encoded" encoder:
Encoder
Phase 4
In phase 4, we add the "FAT (re-encoding)" encoder:
New phase 4 actions:
Encoder
Encoders
Encoder
* : and logs a warning message
! : and logs an error message
Possible encoding methods
1. Unicode Private Use Area (PUA) characters
In phase 1, this encoding replaces reserved characters with Unicode Private Use characters (
\xf000
-\xf0ff
). It is used by GitBash, Windows Subsystem for Linux (WSL), Cygwin, MSYS, Linux's CIFS driver, and other platforms. It requires the underlying filesystem allow UTF-8 characters, such as NTFS, VFAT, and exFAT. See https://cygwin.com/cygwin-ug-net/using-specialnames.html. Proposed by @rasa and others.2. URL-encoded
This encoding replaces reserved characters with their URL-encoded equivalent. See https://en.m.wikipedia.org/wiki/Percent-encoding. This would be a good choice on filesystems that don't support UTF-8 characters. Proposed by @AudriusButkevicius. This may be implemented in phase 5, or earlier, if there is interest.
3. Samba's Catia mapping
This encoding replaces reserved characters using the mapping
"→¨ *→¤ /→ø :→÷ <→« >→» ?→¿ \→ÿ |→¦
. This would be a good choice if the user wants to encode to more visually related characters. See https://www.samba.org/samba/docs/current/man-html/vfs_catia.8.html. Proposed by @JanKanis. This may be implemented in phase 5, or earlier, if there is interest.Other considerations
Automatically selecting a folder's encoder
When a new folder is created, Syncthing could use shirou/gopsutil to determine the folder's underlying filesystem. But according to @AudriusButkevicius: "This is not reliable, esp on android which actively lies to make applications think certain things are supported." Instead, Syncthing could try to create files with
/\"*:<>?|
in the filenames. If it can't, it could set the folder's encoder to FAT. But if it guesses wrong, this could lead to issues, if the user changes the folder back to using the Default encoder. Best to leave this choice up to the user, at least for now.Important links
Feedback appreciated!
Thanks for reading! Please provide your feedback, suggestions, enhancements.
The text was updated successfully, but these errors were encountered: