Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods that move/remove files on the cloud don't remove the cached version #388

Open
parviste-fortum opened this issue Dec 20, 2023 · 3 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@parviste-fortum
Copy link

The following code

import cloudpathlib
s3_path = cloudpathlib.S3Path(f"s3://my-bucket/foo.txt")
s3_path.open("x").close() # Create empty file
s3_path.unlink()
s3_path.open("x").close() # Create empty file

does not work as expected. I would expect it to create a file, remove it, and then create it again. Instead, I get

Traceback (most recent call last):
  File "/workspaces/scoutingtool/backend/../cpltest.py", line 9, in <module>
    s3_path.open("x").close() # Create empty file
    ^^^^^^^^^^^^^^^^^
  File "/home/vscode/.local/lib/python3.11/site-packages/cloudpathlib/cloudpath.py", line 503, in open
    buffer = self._local.open(
             ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/pathlib.py", line 1044, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileExistsError: [Errno 17] File exists: '/tmp/tmpg5cu85sn/my-bucket/foo.txt'

This seems to happen because the file is kept in cache, even when it was removed from S3. A simple solution would probably to use the "w" mode, even when "x" was supplied.

@pjbull
Copy link
Member

pjbull commented Dec 20, 2023

Thanks for the bug report @parviste. Another workaround is to use a different caching mode that more aggressively clears the cache.

I think this is a good bug, and we don't currently update the cache on any of the cloud file move/delete scenarios. It seems worth a fix here to remove files from the cache for the following methods:

  • unlink
  • rename
  • replace
  • rmdir
  • rmtree

@pjbull pjbull added bug Something isn't working good first issue Good for newcomers labels Dec 20, 2023
@pjbull pjbull changed the title Can't recreate removed file in S3 using mode "x" Methods that move/remove files on the cloud don't remove the cached version Dec 20, 2023
@parviste-fortum
Copy link
Author

parviste-fortum commented Dec 21, 2023

It's probably true that the cache should be cleared after moving/removing files, but at the same time, it also doesn't feel fully correct to necessarily write the cache files using the same mode as the files written to S3.

In particular, it would make sense to me to either:

  1. Always write cache files using mode "x" (even when the "w" mode was specified for opening the remote file), as this would allow catching bugs in cloudpathlib where a cached file was written twice (which shouldn't ever really happen?)
  2. Always write cache files using mode "w" (even when "x" mode was specified for opening the remote file), as this would allow us to be able to survive bugs like the one I encountered by simply overwriting the "stale" file, making the application more robust.

I can't really see a situation where it makes sense to write the cache files using the exact mode specified for the remote file, sometimes "x" sometimes "w".

@pjbull
Copy link
Member

pjbull commented Dec 21, 2023

It keeps things simpler to not have special cases for different types of writes, so I'd rather have the cache state well managed than make assumptions with the writing mode. For example, if your original example was done with a instead of x and actually added some text to the file, the output would be wrong unless we both mirrored the state of the remote and opened the cache file in the same mode.

We currently do the following on writing to the cloud: (1) check if file is on the remote before writing with x and (2) refresh the cache before doing any writing,

I think the other change we want here is in _refresh_cache to delete the cache file if the remote file does not exist.

@drivendataorg drivendataorg deleted a comment from Sharadgup Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants