Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync reserved characters proposal #9539

Open
rasa opened this issue May 14, 2024 · 9 comments
Open

Sync reserved characters proposal #9539

rasa opened this issue May 14, 2024 · 9 comments
Labels
enhancement New features or improvements of some kind, as opposed to a problem (bug) needs-triage New issues needed to be validated

Comments

@rasa
Copy link
Member

rasa commented May 14, 2024

The following proposal was inspired by @JanKanis' recent comments.

Goal

As a user, I want to sync filenames containing reserved/unsupported/special characters on any filesystem. For example, I want to sync filenames containing /\"*:<>?| characters on NTFS, exFAT, FAT32 and reFS filesystems, which disallows these characters in filenames. For more information, see #7876, which will be closed once this proposal is approved.

Proposed solution

Each folder will be configured to use an encoder. Initially, there will be two encoders: "Default" and "FAT".

Both existing, and newly created, folders will work exactly as they do now, and will be configured to use the "Default" encoder. Calling it the "Default encoder" is really a misnomer, as it's really just the way Syncthing works right now. It's only called the "Default encoder" in order to differentiate it from other encoders. We could also call it "None", or "Passthrough encoder", but "Default" seems the best choice for non-technical users. The user can change this setting via folder options.

The encoders

The Default encoder

The Default encoder, as described above, is not really an encoder, as it reads and writes filenames on disk "as is," without any encoding. It does not reject or ignore any filenames it receives. It is designed to be used on filesystems that allow all characters except / and NUL, but it can be used on any filesystem. If it's used on a FAT-based filesystem, filenames with reserved characters won't be able to be written to disk, leading to out-of-sync errors if it receives them.

The FAT encoder

The FAT encoder is designed to be used on filesystems that disallow the characters /\"*:<>?| in filenames. When filenames with these characters are written to disk, the FAT encoder encodes the filename in a format that the filesystem will accept. When read from disk, the filename is then decoded before being sent. The FAT encoder can be used on any filesystem, but there is no reason to run it on a non-FAT filesystem.

Potential issues

Here's a list of potential issues I can think of, and how we can address them:

1. A Default encoder finds an encoded filename on disk

This may occur when a user has a folder configured to use the FAT encoder, the folder contains encoded filenames, and the user then changes the folder's encoder back to Default (or they downgrade Syncthing). When the Default encoder reads these filenames from disk, it will log warning messages, but will send these filenames "as is."

2. A FAT encoder receives an encoded filename

This may occur when a Default encoder finds an encoded filename on disk (described above) and sends it to a FAT encoder.

When the FAT encoder receives encoded filenames, it will log a warning message, and ignore the request, leading to out-of-sync errors.

This issue is most likely a mis-configuration issue, where the sending folder should be using the FAT encoder, but it's using the Default encoder instead. But it's possible that a user has filenames that just happen to use the same characters that Syncthing uses to encode characters.

Phase 4 enhancement: A "FAT re-encoding" encoder

In phase 4, we might consider adding a FAT encoder to encode (or "re-encode") the encoded filename again. For example, if we are using the Private Use Unicode characters (\xf000 - \xf0ff) for encoding, we could re-encode \xf0xx as \xf05cf0xx (\xf05c is an encoded backslash).

For example, let's say the file acolon: was originally encoded as acolon\xf03a. When the FAT encoder receives this encoded filename, it would re-encode it as acolon\xf05cf03a\xf03a.

When viewing this filename in a Windows Subsystem for Linux (WSL) or Cygwin shell, the file would appear to the user in a directory listing as literally acolon\xf03a:. In a GitBash or MSYS shell, it would appear as 'acolon'$'\357\201\234'xf03a:. This would visually indicate the filename was re-encoded.

The GUI would display this new encoder as the "FAT (re-encoding)" encoder. The existing FAT encoded would still be called simply the "FAT" encoder.

3. A Default encoder receives an encoded filename

The Default encoder receives encoded filenames, and writes them to disk as-is, but will log warning messages.

Phase 3 enhancement: An "ignoring encoded" encoder

In phase 3, we might consider adding an encoder which is a Default encoder, but it rejects receiving encoded filenames, which will lead to out-of-sync errors. The GUI displays this as the "Ignore encoded filenames" encoder.

4. A Default encoder receives a re-encoded filename (phase 4 issue)

The Default encoder will send and receive re-encoded filenames, but will log warning messages.

The "Ignore encoded filenames" encoder created in phase 3 would ignore these re-encoded files.

5. A FAT encoder receives a re-encoded filename (phase 4 issue)

All FAT encoders reject re-encoded filenames that are received, which would cause an out-of-sync error for that folder. It could re-re-encode, but the odds of needing to re-encode are slim, so the odds of re-re-encoding are probably next to none.

6. A user switches a folder's encoder from FAT to Default via the GUI

In phase 5, Syncthing could scan the folder's database, to determine if the folder contains any encoded filenames. If there are, it could explain the ramifications of the switch, and ask the user to confirm their choice, before saving their change.

7. A user switches a folder's encoder from FAT to Default by editing config.xml

In phase 5, whenever a user changes a folder's encoder Syncthing could write a file to .stfolder/syncthing.yml in the folder's root. Or on startup, if the file doesn't exist, Syncthing creates it. This file will contain the entry encoder: default or encoder: fat.

On a following startup, Syncthing compares this file with the folder's current setting, and if they're different, it logs a warning, and displays a warning in the GUI. It doesn't change the folder's encoder. It also doesn't rewrite the file, unless the user starts Syncthing with the --fix-encoders option.

Recovering from folders containing both encoded and decoded filenames

The problem

Let's say we have three peers: D, F and G. All three share a folder and D and F use the Default encoder. Peer G uses the FAT encoder. Peer F's filesystem is FAT, and so it had an out-of-sync error on the file acolon:. Peer F switched its folder's encoder to FAT, which saved acolon: as acolon\xf03a, and the out-of-sync error went away.

Now, peer F switched the folder's encoder from FAT, back to Default. The Default encoder will find the file acolon\xf03a on disk and sync this file to peer D, which will see it as a new file, and save it. Peer D now has two files named acolon: and acolon\xf03a.

Peer D will then sync these files with peer F. Peer F will still accept acolon\xf03a, but will reject acolon: as it has a reserved character.

Peer D will also sync these files with peer G, which accepts them both, encoding acolon: as acolon\xf03a, and re-encoding acolon\xf03a as acolon\xf05cf03a\xf03a.

Possible solution

A separate CLI program is run on any peer where the folder is not on a FAT filesystem. It searches for files where encoded files (acolon\xf03a) coexist with their decoded equivalents (acolon:). If a pair is found, it displays the two filenames, along with their timestamps, and asks the user to choose to keep:

  1. acolon\xf03a (this would execute mv -f 'acolon\xf03a' 'acolon:')
  2. acolon: (this would execute rm -f 'acolon\xf03a')

Syncthing will then propagate the delete (1.), or the delete/update (2.) to the other peers. Peer F will delete acolon\xf03a. Peer F will still have an out-of-sync error as it can't write to acolon:.

Peer G will delete acolon\xf03a (which is re-encoded on disk as acolon\xf05cf03a\xf03a), and update acolon: (encoded on disk as acolon\xf03a) if applicable.

Startup options could automate the selections:

  1. --newest - always select the newest file
  2. --oldest - always select the oldest file
  3. --encoded - always select the encoded filename
  4. --decoded - always select the decoded filename

Other encoders

1. FAT (Windows) encoder

In phase 2, the existing FAT encoder will be enhanced to encode reserved/disallowed filenames (such as NUL, and adot.). The encoding method will be compatible with GitBash, Windows Subsystem for Linux (WSL), Cygwin, MSYS, Linux's CIFS driver, etc. The filenames are somewhat incompatible in a Windows 11 Command Shell, as they must be referred to as .\NUL or \\.\path\to\NUL. Previous versions of Windows may require \\.\path\to\NUL. The GUI displays this new encoder as "FAT (Windows)".

Also in phase 2, the FAT encoder developed in phase 1 will be displayed as "FAT (Android)" by the GUI, as Android (and other Linux environments) allows filenames that are reserved in Windows, such as NUL.

2. IOS encoder

In phase 5, we may want to add an IOS encoder, that encodes :, and leading periods (except for our .st{filter,ignore,versions} files) as the Files app rejects files with colons and leading periods.

3. Plan9 encoder

In phase 5, we may want to create a Plan9 encoder, that encodes the reserved characters \x01-\x1f and \x80-\x9f.

Summary of actions

Here is a summary of the actions described above, by phase:

Phase 1

We start with the Default and FAT encoders:

Action Default
Encoder
FAT
Encoder
Receives filename that needs encoding Saves filename as-is Encodes filename
Finds filename on disk that is encoded Sends filename as-is Decodes filename
Receives filename this is already encoded Saves filename as-is* Rejects filename!

Phase 2

In phase 2, we add the FAT (Windows) encoder:

Action FAT (Windows)
Encoder
Receives filename that needs encoding Encodes filename
(and Windows reserved filenames)
Finds filename on disk that is encoded Decodes filename
Receives filename this is already encoded Rejects filename!

Phase 3

In phase 3, we add the "Ignoring encoded" encoder:

Action Ignore encoded
Encoder
Receives filename that needs encoding Saves filename as-is
Finds filename on disk that is encoded Ignores filename*
Receives filename this is already encoded Rejects filename!

Phase 4

In phase 4, we add the "FAT (re-encoding)" encoder:

Action "FAT (re-encoding)" encoder
Receives filename that needs encoding Encodes filename
Finds filename on disk that is encoded Decodes filename
Receives filename this is already encoded Re-encodes filename

New phase 4 actions:

Action Default
Encoder
FAT (Android/Windows)
Encoders
Ignore encoded
Encoder
"FAT (re-encoding)" encoder
Finds filename on disk that is re-encoded Sends filename as-is* Rejects filename! Ignores filename! Decodes filename
Receives filename this is re-encoded Sends filename as-is* Rejects filename! Ignores filename! Rejects filename!

* : and logs a warning message

! : and logs an error message

Possible encoding methods

1. Unicode Private Use Area (PUA) characters

In phase 1, this encoding replaces reserved characters with Unicode Private Use characters (\xf000 - \xf0ff). It is used by GitBash, Windows Subsystem for Linux (WSL), Cygwin, MSYS, Linux's CIFS driver, and other platforms. It requires the underlying filesystem allow UTF-8 characters, such as NTFS, VFAT, and exFAT. See https://cygwin.com/cygwin-ug-net/using-specialnames.html. Proposed by @rasa and others.

2. URL-encoded

This encoding replaces reserved characters with their URL-encoded equivalent. See https://en.m.wikipedia.org/wiki/Percent-encoding. This would be a good choice on filesystems that don't support UTF-8 characters. Proposed by @AudriusButkevicius. This may be implemented in phase 5, or earlier, if there is interest.

3. Samba's Catia mapping

This encoding replaces reserved characters using the mapping "→¨ *→¤ /→ø :→÷ <→« >→» ?→¿ \→ÿ |→¦. This would be a good choice if the user wants to encode to more visually related characters. See https://www.samba.org/samba/docs/current/man-html/vfs_catia.8.html. Proposed by @JanKanis. This may be implemented in phase 5, or earlier, if there is interest.

Other considerations

Automatically selecting a folder's encoder

When a new folder is created, Syncthing could use shirou/gopsutil to determine the folder's underlying filesystem. But according to @AudriusButkevicius: "This is not reliable, esp on android which actively lies to make applications think certain things are supported." Instead, Syncthing could try to create files with /\"*:<>?| in the filenames. If it can't, it could set the folder's encoder to FAT. But if it guesses wrong, this could lead to issues, if the user changes the folder back to using the Default encoder. Best to leave this choice up to the user, at least for now.

Important links

  1. A good resource that documents what characters and filenames are reserved on different filesystems is at https://en.wikipedia.org/wiki/Filename#Comparison_of_filename_limitations .

Feedback appreciated!

Thanks for reading! Please provide your feedback, suggestions, enhancements.

@rasa rasa added enhancement New features or improvements of some kind, as opposed to a problem (bug) needs-triage New issues needed to be validated labels May 14, 2024
@JanKanis
Copy link

a nitpick, before I forget it. I'll see if I have more substantial comments when I have more time.

re url encoding and Samba's Catia mapping encoders: FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine. The Catia mapping also requires unicode support. But the filesystem specifically using UTF-8 is not required for any encoder.

@rasa
Copy link
Member Author

rasa commented May 14, 2024

FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine.

You may be right. I was going off of here, but it may be wrong. I will remove that claim.

The Catia mapping also requires unicode support.

Will update doc.

But the filesystem specifically using UTF-8 is not required for any encoder.

Sorry, I don't follow. Can you clarify?

@AudriusButkevicius
Copy link
Member

I think I got lost in the levels of inception of re-encoding, but I think this should be handled exactly the same way as the "case insensitive fs" wrapper.
i.e., there is a file system wrapper that hides the gnarlyness of having to encode/decode files, so to syncthing it should look like every filesystem supports everything, and I'm not that concerned as to how the sausages are made.

I agree that you can end up with cases where you switch between the different wrappers leading to unexpected effects, i.e., files that were claimed to be with: now suddenly get deleted, and replaced with some encoded version, but I think that is ok, as there is no actual data loss, there is just change in names, and setting the encoding back on would unwind this.

In majority of the cases the codec should be a no-op, and that's fine, switching it back and fourth should have no effect, and will only matter for cases where you do have a genuine ":" in the paths, which should be very few cases.

I guess the more interesting case that I don't see handled is where our encoding scheme clashes with files that already exist.
Namely I replace : with unicorn, want to sync a:, but already have a file aunicorn.
I guess perhaps this was covered by the inception of re-encodings, but I guess it lacks clarity and examples for me to digest what is being said there.

Agreed, we can have helper cli utility that help "decode" or "encode" things in place to allow you to convert.

@acolomb
Copy link
Member

acolomb commented May 14, 2024

I think taking a step back and defining the assumed invariants would be good before diving into details and an action plan.

  1. Is "encoding" always a well-defined, reversible process? Do encode and decode schemes have perfect symmetry and lossless round-trips?

  2. Do we acknowledge that there is no way to deduce an encoding solely based on looking at encoded names? As long as we don't have any reserved escape character(s) that are otherwise forbidden except in names encoded by Syncthing, we must assume that detecting "already encoded" names vs. "happens to use one of our replacement characters" is a best-effort heuristic.

  3. Can the encoding / decoding process be made foolproof if we assume the encoding scheme is known? Imagine a more radical encoder which, e.g. translates all names to their base64 equivalent. Then detecting a single file that uses any other character would point to a misconfiguration or externally placed, non-encoded files.

  4. Is the encoding a strictly local matter, or is it announced to other peers?

Regarding point 3, I really like the idea of storing the encoding scheme with the data, under .stfolder. It survives database resets and messing with the configuration. I would even argue that once put there, Syncthing should not support changing the encoding, but rather require the folder to be set up again from scratch. Putting an encoding marker there after the fact should be safe, as any file name found locally that cannot be a product of that encoding, will be renamed (encoded) at that time. Assuming the choice of encoder is sensible (e.g. FAT on a FAT filesystem), there cannot even be unencoded existing names, as the filesystem would not allow them.

As to point 1, we do have some kind of encoders already in Syncthing: encrypted names on untrusted devices (not easily reversible) and the Unicode normalization code (also not reversible if the previous name was not normalized). Looking at those might give some hints regarding the invariance questions. Integrating that functionality with the proposed encoding stuff is probably too far fetched though.

Thinking one step further, I could imagine even more radical encoders emerging, such as the mentioned base64 encoding. That might prove useful to implement further filesystem types in Syncthing, e.g. to add object stores. But then it needs to be clear whether this encoding machinery works with only a (non-reversible) hash function. Again, laying down these invariants / requirements for encoding schemes will help set the boundaries for designing the basic encoders we actually need in the first step.

@JanKanis
Copy link

But the filesystem specifically using UTF-8 is not required for any encoder.

Sorry, I don't follow. Can you clarify?

s/UTF-8/unicode/. It doesn't matter if a filesystem uses UTF8, they need to support unicode, any unicode encoding will do.

@JanKanis
Copy link

FAT12/16/32 do support unicode, as utf-16 I suppose, so they should be able to handle the PUA encoder just fine.

You may be right. I was going off of here, but it may be wrong. I will remove that claim.

The base filesystems only support 8.3 length non-unicode filenames, but Windows uses an extension to also store longer unicode filenames as an add-on.

@calmh
Copy link
Member

calmh commented May 15, 2024

Thank you for writing this up, it's an excellent summary of the problem, your proposed solutions, and the potential issues. ❤️

For me, however, it also illustrates quite clearly why I'm disinclined to accept the proposal (and the corresponding PR). In a nutshell, the problem ("I want to sync filenames containing reserved/unsupported/special characters on any filesystem") is fairly easily avoided and/or corrected when it surfaces. The proposed solutions, however, are complicated and error prone, and the result of mistakes and misconfigurations much harder to reason about and fix than the original problem. In my mind this makes the cost higher than the benefit.

@rdebath
Copy link

rdebath commented May 15, 2024

Some small points to start with ...

  • There seems to be a requirement for a machine with a "Default" encoder within the swarm; it should not be assumed that there will be a Unix host available, all running on different versions of Windows seems rather likely. This would mean that if an particular host switches to a particular non-"Default" encoder you need to be able to fix the "mess" from that host.
  • Please use vFAT rather than FAT for your encoder name. This is because, pedantically, FAT is a filesystem that only supports uppercase Ascii 8+3 filenames (with some extras depending on localisation) that can (IMO) only be upgraded by a full overlay filesystem like umsdos.
  • Don't forget that the valid characters on a Windows filesystem depend on the version of Windows (sometimes even build).

@rdebath
Copy link

rdebath commented May 15, 2024

Oh and as a counterpoint.

The requirement seems to be that a particular host has all the files created on every other peer irrespective of the name it might be given here to overcome any local limitation. This is presumably useful for things like backup servers.

In that case a translation like the previously mentioned base64 would be acceptable, BUT might still hit a file length limitation. Taking a secure hash (MD5, SHA1 etc) of the pathname would give a name with four or five 8 character sections for any original filename which is (basically) guaranteed to be unique.

A small database containing a list of all the paths would be required to know what filenames are stored on the local FS. Working with the filesystem would be mostly trivial but there would be no method of migrating to or from this scheme except for adding another peer to the swarm. Though individual pathnames can be translated using simple tools like sha1sum so restoring particular files would be quite feasible.

Personally I'm more likely to make the backup server a Linux box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements of some kind, as opposed to a problem (bug) needs-triage New issues needed to be validated
Projects
None yet
Development

No branches or pull requests

6 participants