Skip to content

Commit

Permalink
minimize ambiguity in algorithm
Browse files Browse the repository at this point in the history
- Add definition of "file/folder extension"
- Remove Needless examples for extensions
- Give proper names to test case in test plan and add explanations
- Reword some unclear statements
- Correct a handful for wrong spellings

Signed-off-by: fenn-cs <fenn25.fn@gmail.com>
  • Loading branch information
Fenn-CS committed May 3, 2022
1 parent c57fbf8 commit 60a0f90
Showing 1 changed file with 17 additions and 57 deletions.
74 changes: 17 additions & 57 deletions docs/collisions.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ Permanent’s file system allows users to upload **files and folders** with exac

**In traditional file systems, name collisions are not allowed**. Typically such file systems handle name collisions by appending a number that uniquely identifies the incoming file (or folder) within parentheses. Sometimes there’s a possibility to override or replace duplicates but this is not a consideration for Permanent and rclone.

To achieve Permanent’s feature goal of letting it synchronize with other filesystems, a way needs to be found to ensure **name collisions which are allowed within permanent can be safely transferred and mirrored on external file systems that do not allow name collisions.**
To achieve Permanent’s feature goal of letting it synchronize with other file systems, a way needs to be found to ensure **name collisions which are allowed within permanent can be safely transferred and mirrored on external file systems that do not allow name collisions.**


## Name collision handling algorithm for external filesystems
## Name collision handling algorithm for external file systems

Since Permanent already knows how to handle collisions the deduplication mechanism is designed to ensure that mapping files from Permanent to other file systems first of all works and then, assumes a structure similar to the one it had on permanent to the greatest extent possible. This makes it easy for users to easily identify their file structure out of Permanent.

Expand Down Expand Up @@ -48,8 +48,8 @@ For example if a folder has five 5 with the same name say **_A.txt _** the five


- Hence, the standard convention employed by most file systems that is; “space, parentheses open, number, parentheses close” for example (5) is used.
- The number within the parenthesis is a sequential calculated by increamenting until it constitutes a unique `downloadName` within the destination namespace or folder.
- If an incoming file of folder has a dedube string that can lead to recursive contecationation, the existing dedupe string is updated to until it is unique. For example say an incoming file has name `A (1).txt` if there were already two files both named `A.txt` then one would have the download name `A (1).txt` implying that the unique download name for the incomding file should predictively be `A (1) (1).txt`. However, multiple deduplication strings would be avoided and the number in the last pathensis closest to the extension would be updated. Conclusively, the incoming file `A (1).txt` in the stated example would assume a download name such as `A (1+n).txt` instead of `A (1)(1).txt` and subsequently or arrangements like `A (1)(1)(1).txt | A (2)(3)(1).txt` ....
- The number within the parenthesis is a sequential calculated by incrementing until it constitutes a unique `downloadName` within the destination namespace or folder.
- If an incoming file of folder has a dedupe string that can lead to recursive concatenation, the existing dedupe string is updated to until it is unique. For example say an incoming file has name `A (1).txt` if there were already two files both named `A.txt` then one would have the download name `A (1).txt` implying that the unique download name for the incoming file should predictively be `A (1) (1).txt`. However, multiple deduplication strings would be avoided and the number in the last parentheses closest to the extension would be updated. Conclusively, the incoming file `A (1).txt` in the stated example would assume a download name such as `A (1+n).txt` instead of `A (1)(1).txt` and subsequently or arrangements like `A (1)(1)(1).txt | A (2)(3)(1).txt` ....

### Algorithm

Expand All @@ -66,58 +66,18 @@ For example if a folder has five 5 with the same name say **_A.txt _** the five

#### Extensions

For both files and folders the extension is any string that follows after the last dot.

For example:
<table>
<thead>
<th>File/Folder Name</th>
<th>Extension</th>
<th>Remark</th>
</thead>
<tbody>
<tr>
<td>My File.txt</td>
<td>txt</td>
<td>Regular</td>
</tr>
<tr>
<td>My File.txt(1)</td>
<td>txt(1)</td>
<td>Extension contains non-alphanumeric chars</td>
</tr>
<tr>
<td>My File.app</td>
<td>app</td>
<td>Regular</td>
</tr>
<tr>
<td>something.something</td>
<td>something</td>
<td>Name same as extension</td>
</tr>
<tr>
<td>2022.03.02</td>
<td>02</td>
<td>Extension is number</td>
</tr>
<tr>
<td>Folder.Work Files</td>
<td>Work Files</td>
<td>Extension contains spaces</td>
</tr>
</tbody>
</table>

As seen, extension can take infinite number of "shapes" BUT as long as all characters in the file name including the extension are valid file name chacters, then, **the extension is the string that follows after the last dot** (Emphasis).
For both files and folders the extension is any string that follows after the last dot in the `displayName` or `uploadFileName` (see; *Source of extensions* section below).

This means that an extension might contain spaces, parentheses, and other characters not traditionally thought of as extension characters. It also means that the last dot could also be the first dot, in which case the extension is the remaining string after that first dot.

#### Source of extensions

a) The **source string for file extensions ON FILES** could be the **`displayName`** OR the **`uploadFileName`**. When a file is uploaded to Permanent if the extension is recognized, then the `displayName` is calculated as `uploadFileName - File Extension`. So, if `My File.txt` is uploaded the `displayName` is `My File` while the `uploadFileName` would be `My File.txt`. This means when Permanent recognizes the extension, it removes it (the extension) from the `displayName` and keeps track of it(the extension) via the `uploadFileName`. When Permanent does not recognize the extension, the `uploadFileName` is thesame as `displayName` **at the point of creation**. For exampl, in the case of `File My.abcxyz` both the `uploadFileName` and `displayName` would be `My File.abcxyz` and so in this case the source of the extension is the `displayName` even though the extension can be obtained from the `uploadFileName` it risk being wrong if the user ever updates the `displayName`. This is the case because an update to the `displayName` does not affect the `uploadFileName` correctly so.
a) The **source string for file extensions ON FILES** could be the **`displayName`** OR the **`uploadFileName`**. When a file is uploaded to Permanent if the extension is recognized, then the `displayName` is calculated as `uploadFileName minus File Extension`. So, if `My File.txt` is uploaded the `displayName` is `My File` while the `uploadFileName` would be `My File.txt`. This means when Permanent recognizes the extension, the `displayName` is stored without the extension so we keep track of the extension via the `uploadFileName`.

When Permanent does not recognize the extension, the `uploadFileName` is the same as `displayName` **at the point of creation**. For example, in the case of `My File.abcxyz` both the `uploadFileName` and `displayName` would be `My File.abcxyz` and so in this case the source of the extension is the `displayName` even though the extension can be obtained from the `uploadFileName` it risk being wrong if the user ever updates the `displayName`. This is the case because an update to the `displayName` does not affect the `uploadFileName` correctly so.


b) The **source string for file extensions ON FOLDERS** is the `displayName` because Permanent does not (or at least did not ) do any kind of extension proccessing on folders, hence folder extensions are always visible in their `displayName`.
b) The **source string for file extensions ON FOLDERS** is the `displayName` because Permanent does not (or at least did not ) do any kind of extension processing on folders, hence folder extensions are always visible in their `displayName`.


### Reserved/unsupported characters
Expand All @@ -129,7 +89,7 @@ Characters that do not map to various file systems would be encoded. For example
* [Path Naming Conventions](https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file)
* [Reserved Characters](https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words)

_What each unsupported character encodes to has to ultimately be decided and the reference table developed if neccesary._
_What each unsupported character encodes to has to ultimately be decided and the reference table developed if necessary._


## Example
Expand Down Expand Up @@ -435,13 +395,13 @@ NB: downloadFileName & downloadFolderName **should be recalculated after file or

## Testing Plan

- Test generates directory-unique & correctly formatted `downloadName` for n colliding files in the same namespace (parent folder).
- Test generates directory-unique & correctly formatted `downloadName` for n colliding folders in the same namespace.
- Test generates directory-unique & correctly formatted `downloadName` for n colliding files & folders in the same namespace.
- Test generates directory-unique & correctly formatted `downloadName` for incoming files and folders holding colliding deduplication strings.
- Test generates directory-unique & correctly formatted `downloadName` for files and folders with uncommon naming styles.
- **File to file collision tests**: Test that colliding files get a directory-unique & correctly formatted `downloadName`, for instance in a case where 2 or more files in the same folder have the same name.
- **Folder to folder collision tests**: Test that colliding folders get a directory-unique & correctly formatted `downloadName`, for instance in a case where 2 or more folders in the same folder have the same name.
- **File to folder collision tests**: Test that files colliding with folders or vice-verser get a directory-unique & correctly formatted `downloadName`. For instance, in a case where 2 or more *files AND folders* in the same folder have the same name.
- **File/Folder with dedupe string tests**: Test that files and folder with existing deduplication strings in them get a directory-unique & correctly formatted `downloadName`. For instance files containing dedupe strings such as `My File (1).png` as their original name.
- **Edge case tests**: Test that files and folders with weird extensions, uncommon characters in files name, name with too many dots et al get a directory-unique & correctly formatted `downloadName` for files and folders with uncommon naming styles.

*Directory-unique: Download names need be unique only for files and folders in thesame directory*
*Directory-unique: Download names need be unique only for files and folders in the same directory*


# Synchronization
Expand Down

0 comments on commit 60a0f90

Please sign in to comment.