Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some libraries generate confusing errors when opening CloudPath objects #315

Open
pjbull opened this issue Jan 24, 2023 · 6 comments
Open

Comments

@pjbull
Copy link
Member

pjbull commented Jan 24, 2023

The main problem is the libs based on opening files with fsspec won't work properly with CloudPaths.
They will call fspath to resolve it (instead of str()) and therefore the file called will be cached (this is not exaclty what we want, and this will be done silently)

For single files like tif this will be not optimal but will work (your example)
But for nested files likes VRT, this will crash as only the header is cached and not the relative files. This will break the paths inside the VRT.
Note that GDAL (and therefore rasterio) can read VRT stored on the cloud, and this is the expected behavior.

A little example:

vrt = AnyPath("https://s3.unistra.fr/merge_32-31.vrt")
rasterio.open(str(vrt)) # will work
rasterio.open(vrt) # will fail for the wrong reason, saying that the files linked inside the VRT don't exist

Originally posted by @remi-braun in #96 (comment)

@pjbull
Copy link
Member Author

pjbull commented Jan 24, 2023

There aren't great options here. This is what I see. Open to other options if people have them. Maybe some combination of these are good mitigations.

Stop being os.PathLike

A lot of these libraries do something like:

def open(f):
    is instance(f, os.PathLike):
        str_path = os.fspath(f)
    ...

They then have special-case logic to handle str_path and work with the local filesystem path. This only works with CloudPaths in read scenarios for single files. Other scenarios (writing, multi-file formats, directory walking) are all broken with code that works like this.

When the code eventually goes looking on the filesystem relative to str_path it errors in potentially confusing ways.

If we are no longer os.PathLike then code will likely error earlier with clearer error messages.

Pros:

  • Possibly clearer error messages

Cons:

  • Breaks single-file read scenarios, which are very common.

Cache entire directory on fspath call

We could do something like the following (and make this an option)

def __fspath__(self):
    if self.is_dir():
        self.download_to(self._local)  # cache entire directory recursively on `fspath` call      

Pros:

  • No missing files when relative ones are referred to.
  • Could be optional based on setting/envvar

Cons:

  • Likely a lot of wasted network traffic and disk space usage (depending on what is in the directory)

Warn on __fspath__ being called

We could just always warn about writing/multi-file whenever __fspath__ is called.

def __fspath__(self):
    warnings.warn(f"File {self} has been converted to a local path; writing and uploading or opening multi-file formats will not work as expected.")
    ...

Pros:

  • Raise awareness of what might not work as expected earlier

Cons:

  • Pollutes logs/warnings stream for the many scenarios where things do work as expected

Try to detect who is trying to do what and how and raise errors in those cases

We may be able to provide more useful errors if we can detect who is calling .str, .__fspath__ or the built-in open function.

Pros:

  • We can raise Exceptions for scenarios that will not work

Cons:

Provide documentation for other library developers to support CloudPaths

The best way to support CloudPaths is for developers to just use the pathlib API throughout their code base. Since we support the majority of the pathlib API, then most code should work reasonably.

We could add a documentation page with recommendations on how to do this effectively. We could provide code snippet examples. Here are the key points:

  • Move from using os. methods for path manipulation to the pathlib API
  • Convert passed strings into Path or CloudPath objects immediately
  • Keep paths as *Path* objects as long as possible. Don't do calls to str or fspath until you absolutely need to open a single file.

Pros:

  • We could do this today.

Cons:

  • Requires implementation changes in other libraries before any end-user results.

Coordinate a sentinel convention to trigger fsspec code

We could potentially have a sentinel _fsspec property that indicates to a library they should open with fsspec. For example:

# in cloudpathlib

@property
def _fsspec(self):
    return str(self)
# in fsspec 
def open(f):
    if hasattr(f, '_fsspec'):
        f = f._fsspec
# in other libraries
def open(f):
    is if hasattr(f, '_fsspec'):
        handle = fsspec.open(f)
    ...

Alternatively, fsspec write a shim to handle CloudPath objects:

# in fsspec 
def open(f):
    if isinstance(f, CloudPath):
        str_cloud_path = str(f)
        ... # dispatch with normal `fsspec`

Either would require that third parties pass the object to fsspec without manipulating it first (e.g., they should not call str or fspath).

Pros:

  • CloudPath objects play nicely in the fsspec ecosystem

Cons:

  • Requires coordination across libraries/maintainers
  • Still requires implementation updates for other libraries

@remi-braun
Copy link

Just to add that the workaroud is for the user to pass str(cloudpath) to the problematic libraries, so this is not urgent (I think).
But it would be very comfortable to have a universal way of handling cloud paths (I cannot help you for that, this is way beyond my skills haha)

@pjbull
Copy link
Member Author

pjbull commented Jan 25, 2023

Good point, thanks for adding!

@barneygale
Copy link

FWIW I'm working on adding a pathlib.AbstractPath class in the standard library. If that's introduced, and libraries like cloudpathlib start subclassing it, then other code can do:

def do_thing(path):
    if not isinstance(path, pathlib.AbstractPath):
        path = pathlib.Path(path)
    with path.open('rb') as f:
        ...

In this way, functions can accept string paths, pathlib.Path objects, and any other objects that inherit from pathlib.AbstractPath and therefore provide an open() method.

FWIW, I'm also planning to make __fspath__() a Path-only feature (not present in PurePath or AbstractPath) so that open(ZipFile('README.txt', archive=ZipFile('foo.zip'))) doesn't try to open README.txt in the current directory, which would be massively confusing.

@pjbull
Copy link
Member Author

pjbull commented Jan 25, 2023

Thanks @barneygale! Yes, we'd support having a way to signal pathlib API compatibility in the standard library so that we can signal to third part libraries to treat us like a path (which really helps with the problems listed above about people calling os.fspath too early.

Will os.PathLike still be dependent on __fspath__ being implemented? I guess in that case we would want to do is not be os.PathLike so that we don't get fspath'd, but be an AbstractPath so we signal to libraries to use pathlib functions.

Let me know if we can help testing or reviewing anything there.

(That said, this still puts us in the "get other library maintainers to rewrite their code to use/support pathlib APIs" solution bucket.)

@barneygale
Copy link

I'm not planning to change anything about os.PathLike, os.fspath() or __fspath__(), except that I'll move PurePath.__fspath__() to Path.__fspath__(). The proposed AbstractPath would sit between those classes, so it wouldn't have __fspath__(), and therefore wouldn't be path-like. This will require a proper deprecation process in CPython (2 or more minor releases) in which we raise warnings from PurePath.__fspath__().

I think it's a losing game trying to provide __fspath__() in other path implementations - it's impossible to know exactly how the function that calls os.fspath() intends to use the resulting string path. You can do clever things with caching of course, and unfortunately that's going to be necessary for at least another couple years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants