Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining a cloudpath and string representing an absolute path removes the bucket name #344

Open
chopeen opened this issue Jun 23, 2023 · 6 comments

Comments

@chopeen
Copy link

chopeen commented Jun 23, 2023

from cloudpathlib import CloudPath

# absolute local path
local_path = '/tmp/some-file.txt'

# the local path is used directly to construct the cloud path
cloud_path = CloudPath('s3://some-bucket') / local_path

# the upload will fail
cloud_path.upload_from(local_path)

There are two errors in the output:

  • An error occurred (403) when calling the HeadObject operation: Forbidden
  • An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

but neither of them explains the actual cause.

Traceback (most recent call last):
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 186, in _s3_file_query
    self.client.head_object(
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/cloudpathlib/cloudpath.py", line 842, in upload_from
    if self.exists() and self.is_dir():
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/cloudpathlib/cloudpath.py", line 389, in exists
    return self.client._exists(self)
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 179, in _exists
    return self._s3_file_query(cloud_path) is not None
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 197, in _s3_file_query
    return next(
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/cloudpathlib/s3/s3client.py", line 198, in <genexpr>
    (
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/boto3/resources/collection.py", line 81, in __iter__
    for page in self.pages():
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/boto3/resources/collection.py", line 171, in pages
    for page in pages:
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/botocore/paginate.py", line 269, in __iter__
    response = self._make_request(current_kwargs)
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/botocore/paginate.py", line 357, in _make_request
    return self._method(**current_kwargs)
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/users/foo/.pyenv/versions/bar-env/lib/python3.10/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
@pjbull pjbull changed the title When an absolute path is specified as S3 key, the upload fails with misleading error "An error occurred (403) when calling the HeadObject operation: Forbidden" Joining a cloudpath and string representing an absolute path removes the bucket name Jun 23, 2023
@chopeen
Copy link
Author

chopeen commented Jun 23, 2023

Thanks @pjbull for correcting the title! You are right and that's exactly what the issue is:

>>> from cloudpathlib import CloudPath
>>> local_path = '/tmp/some-file.txt'
>>> cloud_path = CloudPath('s3://some-bucket') / local_path

>>> print(cloud_path)
s3://tmp/some-file.txt
    ⬆️
  bucket name is missing in the path

@pjbull
Copy link
Member

pjbull commented Jul 5, 2023

Thanks @chopeen for this issue and the PR with a proposed fix. Apologies for the holdup, but I'm not 100% confident that this is the "right" fix, though I understand your use case.

We need to answer the question of whether a bucket or a cloud-provider is the equivalent of a drive.

As things work now, usually the root level / is just the cloud provider. This lets you work across buckets simply, like the following:

from cloudpathlib import CloudPath
cp = CloudPath("s3://bucket1/dir/file.txt")
cp
#> S3Path('s3://bucket1/dir/file.txt')

# reference a file in another bucket with the same client
cp2 = (cp / "/bucket2/dir/file.txt")
cp2
#> S3Path('s3://bucket2/dir/file.txt')

It is a useful convention since, for example, it let us do things like list all the available buckets on a provider, which was a requested feature. I do think with either option I present below we'll want to keep at least special casing s3:// to be able to list buckets like this.

from cloudpathlib import CloudPath

list(CloudPath("s3://").iterdir())
#> [ S3Path('s3://bucket1'),
#>   S3Path('s3://bucket2'),
#>   ... ]

That said, we're not consistent everywhere. For example, .parents does not include the provider root and .drive returns the bucket. Here's a comparison with WindowsPath behavior since posix paths don't have the same distinction:

from pathlib import PureWindowsPath
wp = PureWindowsPath("C:/bucket1/dir/file.txt")

wp.parts
#> ('C:\\', 'bucket1', 'dir', 'file.txt')

list(wp.parents)
#> [ PureWindowsPath('C:/bucket1/dir'),
#>   PureWindowsPath('C:/bucket1'),
#>   PureWindowsPath('C:/')]


from cloudpathlib import CloudPath
cp = CloudPath("s3://bucket1/dir/file.txt")

cp.parts
#> ('s3://', 'bucket1', 'dir', 'file.txt')

list(cp.parents)
#> [S3Path('s3://bucket1/dir'), S3Path('s3://bucket1')]

cp.anchor
#> 's3://'

wp.anchor
#> 'C:\\'

cp.drive
#> 'bucket1'

wp.drive
#> 'C:'

This example may also be a little misleading since the .drive property is a string, not a Path. I do think we should be consistent, and I'd like to get some community input on which interpretation is most natural.

Please add a 👍 if you think that the bucket should be the drive and absolute paths are relative to the bucket and a 👎 if you think the provider should be the drive and absolute paths are relative to the provider.

  • 👍 : (CloudPath("s3://b1") / "/") == CloudPath("s3://b1")
  • 👎 : (CloudPath("s3://b1") / "/") == CloudPath("s3://")

It is worth noting, that this will also affect a number of additional properties and methods like relative_to, glob, rmdir, and likely others.

Finally, either way we go this change would need a deprecation cycle and warnings since folks may rely on the existing inconsistent behavior.

@chopeen
Copy link
Author

chopeen commented Jul 5, 2023

Thank you for a detailed response!

For me, (CloudPath("s3://b1") / "/") == CloudPath("s3://") is misleading, because / for paths is an additive operation and therefore it makes no sense for b1 to suddenly disappear.


s3:// is URI scheme and the equivalent for local paths is file:// https://en.m.wikipedia.org/wiki/File_URI_scheme.

@pjbull
Copy link
Member

pjbull commented Jul 5, 2023

For me, (CloudPath("s3://b1") / "/") == CloudPath("s3://") is misleading, because / for paths is an additive operation and therefore it makes no sense for b1 to suddenly disappear.

I understand what you're saying, but I expect the opposite will be more confusing to other developers. Paths that start with / are absolute paths, so in pathlib Path('/dir') / '/dir2/file.txt' will result in Path('/dir2/file.txt'):

>>> from pathlib import Path
>>> Path('/dir1') / '/dir2/file.txt'
PosixPath('/dir2/file.txt')

I think that because of this any relative path (whether local or cloud) when represented as a string should not start with a /.

@chopeen
Copy link
Author

chopeen commented Jul 6, 2023

in pathlib Path('/dir') / '/dir2/file.txt' will result in Path('/dir2/file.txt')

Wow, that's surprising! I have just checked and it is documented, but I was not aware of it anyway.

Now I understand why you'd like to keep pathlib and cloudpathlib behave consistently when concatenating an absolute path.

Should I change my PR to only log a warning in files#diff-7fa444066bc9ac67a45a3ccd86ec9a23f16b3ce4a0c48752578fe51db3d80346R695, to explicitly communicate something like f'Segment {other} is an absolute path, so operator "/" will ignore all previous segments; that may include the bucket name'? Or abandon it altogether?

@pjbull
Copy link
Member

pjbull commented Jul 22, 2023

Thanks @chopeen, I think that we'll close that PR for now since we generally avoid emitting noisy warnings, and for the most part if someone hits this an exception will bubble up immediately.

I'm going to leave this issue open though to focus in getting consistent on what we want the "drive" of a CloudPath to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants