Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple files created on multiple upload #108

Open
jvasile opened this issue Mar 16, 2023 · 9 comments
Open

Multiple files created on multiple upload #108

jvasile opened this issue Mar 16, 2023 · 9 comments

Comments

@jvasile
Copy link

jvasile commented Mar 16, 2023

rclone copy of the same file multiple times results in multiple copies.

Expected behavior: rclone and sftp-service notice that the file is already there and doesn't make a second copy

@jvasile jvasile closed this as completed Mar 17, 2023
@jvasile
Copy link
Author

jvasile commented Mar 17, 2023

And I just now got it consistently:

Grab https://github.com/OpenTechStrategies/permanent-rclone-qa/ and run generate-tree.py, then (assuming you have an rclone remote named permanent-prod):

./upload-test.py test-tree/challenging-names --only=414 --remote-dir=test-414 --log-file=414.txt --remote=prod
./upload-test.py test-tree/challenging-names --only=414 --remote-dir=test-414 --log-file=414.txt --remote=prod

Then look in the test-414 dir and you'll see two copies.

Interestingly, I made a test specifically to check for dupes. And when it runs, only one file results on the server. The following command upload twice without creating a dupe in the misc folder:

./upload-test.py test-tree/misc --only=002 --remote=prod --remote-dir=misc

@jvasile jvasile reopened this Mar 17, 2023
@slifty
Copy link
Contributor

slifty commented Mar 22, 2023

This may have been an issue related to the deployed state of the cache, which had a bug where it would trust the cache a little too much when checking for the existence of files. If that's the case, it will be resolved by #117

I'll give this a try as well though, after production deployment, and if it seems to be working I'd ask that you re-run again to confirm that it's working for you too.

@jvasile
Copy link
Author

jvasile commented Mar 28, 2023

@Fenn-CS Would you please re-run this and report back here?

@Fenn-CS
Copy link
Contributor

Fenn-CS commented Mar 28, 2023

Unfortunately, this bug still exists.

Running

./upload-test.py test-tree/challenging-names --only=414 --remote-dir=test-414 --log-file=414.txt --remote=prod --archive-path="/archives/QA (0a21-0000)/My Files/"

each time would result in the sftp-service reporting

2023/03/28 23:31:08 DEBUG : front_414_back.*txt: Need to transfer - File not found at Destination
2023/03/28 23:31:17 DEBUG : front_414_back.*txt: Not found after upload with set_modtime=false so returning best guess
2023/03/28 23:31:17 INFO  : front_414_back.*txt: Copied (new)

The resulting effect is that a new file would be created on Permanent which can be seen in the UI.

@Fenn-CS
Copy link
Contributor

Fenn-CS commented Apr 2, 2023

Update.

This bug seems to only affect files with asterisks (*) in their name.

Debug path

  • Verified that it's not just weird characters in extensions that cause the problem. For example test with 499 (with a .=png extension) but just asterisk
    • Consistent with asterisk hypothesis
  • Test with files that have weird patterns but are not affected by this bug examples (408, 409, 410)
    • Consistent with asterisk hypothesis
  • Confirm that the bug is not related to the content of the file, copied 409 that is front_409_back.)png to front_409_back.*png (that now includes *) and run upload.
    • Consistent with asterisk hypothesis (same file which was not affected, renamed to contain an asterisk is now affected)
  • Upload a range of files containing asterisks (*, file*.txt, file.*txt, *.txt)
├── *
├── file*.txt
├── file.*txt
└── *.txt
- Shows that not just extensions containing asterisk are the problem 
- Shows that a file containing an asterisk at any position would be affected

Other Observations

Looks like our normalization of names is too wild? or maybe not BUT here is what display names look like on permanent

    $fsNameToPermanentDisplayName = [
        '*' => '-',
        'file*.txt' => 'file-.txt',
        'file.*txt' => 'file.*txt',
        '*txt' => '-.txt'
    ];
  • The loop of the map indicates the normalization that should end at the level of downloadName is equally extended to the displayName this is the cause of this bug but could be a related bug "fileSystemCompatibleName treated as displayName and downloadName when uploading".

    • In other words we should not normalize names when uploading, because they are coming from a regular file system and would only contain supported characters. Permanent would handle normalization but just of downloadName (or is this a backend bug where the backend is normalizing the displayName ? )
  • Obviously and on a more lighter note, the display names discussed above would appear as many times as the upload command is run on the same set of files, hence establishing the consistency of the bug with this set of files.

@slifty
Copy link
Contributor

slifty commented Apr 3, 2023

Interesting -- is there a reason that our normalizer is replacing * with -? Maybe we need to adjust that?

@slifty
Copy link
Contributor

slifty commented Apr 19, 2023

@Fenn-CS what's the current status of this? Is there action that needs to happen on the backend / downloadName logic to prevent it?

@Fenn-CS
Copy link
Contributor

Fenn-CS commented Apr 19, 2023

I and @jvasile deprioritized this. To focus on testing if regular file names that are more common work correctly.

As for this, If I should continue looking at this I think it should be from the backend.

cc @slifty

@jvasile
Copy link
Author

jvasile commented Apr 20, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants