Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3fs.copy fails when using endpoint_url=endpoint_url for s3fs.S3FileSystem(...) #824

Open
RobinHolzingerQC opened this issue Nov 15, 2023 · 7 comments

Comments

@RobinHolzingerQC
Copy link

Working with the move and copy functions of s3fs I encountered the problem that the current implementation seems to cause issues when specifying an endpoint_url=endpoint_url in the constructor of s3fs.S3FileSystem.

(V1) Without endpoint_url all functions (read, write, move, copy) work as expected when specifying paths with a <bucket_name>/ prefix.

(V2) When using endpoint_url the read, write, remove, ... options still work without issues, however, copy (and therefore move) fails FileNotFoundError: The specified bucket does not exist

Examples (setup)

import s3fs

base_path = "channel1/subdir/"
dummyfile_name = "dummy_file"

bucket_name = "rh-devbox"
region_name = "eu-central-1"
endpoint_url = f"https://{bucket_name}.s3.{region_name}.amazonaws.com/"

aws_access_key_id = "..."
aws_secret_key = "..."

Example 1 - without endpoint_url (fully working)

## Variant 1: don't use endpoint_url (works with fs.copy, fs.move)
fs = s3fs.S3FileSystem(
    key=aws_access_key_id,
    secret=aws_secret_key,
    client_kwargs={
        "region_name": "eu-central-1",
    }
)

# bucket information needs to be encoded in paths
src_path = f'{bucket_name}/{base_path}{dummyfile_name}'
dst_path = f'{bucket_name}/{base_path}{dummyfile_name}_dst'

print(fs.read_text(src_path, encoding='utf-8'))
fs.copy(src_path, dst_path)
print(fs.read_text(dst_path, encoding='utf-8'))
fs.rm_file(dst_path)

Example 2: with endpoint_url (copy not working)

fs = s3fs.S3FileSystem(
    key=aws_access_key_id,
    secret=aws_secret_key,
    endpoint_url=endpoint_url,
    client_kwargs={
        # endpoint_url=endpoint_url,
        "region_name": "eu-central-1",
    }
)

# no bucket information in paths
src_path = f'{base_path}{dummyfile_name}'
dst_path = f'{base_path}{dummyfile_name}_dst'

print(fs.read_text(src_path, encoding='utf-8'))
fs.copy(src_path, dst_path)
print(fs.read_text(dst_path, encoding='utf-8'))
fs.rm_file(dst_path)

Error:

---------------------------------------------------------------------------
NoSuchBucket                              Traceback (most recent call last)

<...>

cdn.net/Users/robinholzinger/quantco/core/quetz/.debug/~/micromamba/envs/quetz/lib/python3.10/site-packages/aiobotocore/client.py:384) else:

NoSuchBucket: An error occurred (NoSuchBucket) when calling the CopyObject operation: The specified bucket does not exist

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)

<...>

FileNotFoundError: The specified bucket does not exist
@martindurant
Copy link
Member

Can you please include the traceback, so I can see which branch within cp_file is failing and where?

@RobinHolzingerQC
Copy link
Author

Thanks @martindurant for taking a look into this! Here my traceback and some insights from debugging:

traceback:

---------------------------------------------------------------------------
NoSuchBucket                              Traceback (most recent call last)
File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py:112, in _error_wrapper(func, args, kwargs, retries)
    111 try:
--> 112     return await func(*args, **kwargs)
    113 except S3_RETRYABLE_ERRORS as e:

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/aiobotocore/client.py:383, in AioBaseClient._make_api_call(self, operation_name, api_params)
    382     error_class = self.exceptions.from_code(error_code)
--> 383     raise error_class(parsed_response, operation_name)
    384 else:

NoSuchBucket: An error occurred (NoSuchBucket) when calling the CopyObject operation: The specified bucket does not exist

The above exception was the direct cause of the following exception:

FileNotFoundError                         Traceback (most recent call last)
/Users/robinholzinger/robin/test/s3fs_test/debug.ipynb Cell 2 line 1
     12 dst_path = f'{base_path}{dummyfile_name}_dst'
     14 print(fs.read_text(src_path, encoding='utf-8'))
---> 15 fs.copy(src_path, dst_path)
     16 # print(fs.read_text(dst_path, encoding='utf-8'))
     17 # fs.rm_file(dst_path)

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py:118, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    115 @functools.wraps(func)
    116 def wrapper(*args, **kwargs):
    117     self = obj or args[0]
--> 118     return sync(self.loop, func, *args, **kwargs)

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py:103, in sync(loop, func, timeout, *args, **kwargs)
    101     raise FSTimeoutError from return_result
    102 elif isinstance(return_result, BaseException):
--> 103     raise return_result
    104 else:
    105     return return_result

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py:56, in _runner(event, coro, result, timeout)
     54     coro = asyncio.wait_for(coro, timeout=timeout)
     55 try:
---> 56     result[0] = await coro
     57 except Exception as ex:
     58     result[0] = ex

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/fsspec/asyn.py:390, in AsyncFileSystem._copy(self, path1, path2, recursive, on_error, maxdepth, batch_size, **kwargs)
    388 if on_error == "ignore" and isinstance(ex, FileNotFoundError):
    389     continue
--> 390 raise ex

File ~/micromamba/envs/s3fs-test/lib/python3.11/asyncio/tasks.py:452, in wait_for(fut, timeout)
    449 loop = events.get_running_loop()
    451 if timeout is None:
--> 452     return await fut
    454 if timeout <= 0:
    455     fut = ensure_future(fut, loop=loop)

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py:1745, in S3FileSystem._cp_file(self, path1, path2, preserve_etag, **kwargs)
   1740     await self._copy_etag_preserved(
   1741         path1, path2, size, total_parts=int(parts_suffix)
   1742     )
   1743 elif size <= MANAGED_COPY_THRESHOLD:
   1744     # simple copy allowed for <5GB
-> 1745     await self._copy_basic(path1, path2, **kwargs)
   1746 else:
   1747     # if the preserve_etag is true, either the file is uploaded
   1748     # on multiple parts or the size is lower than 5GB
   1749     assert not preserve_etag

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py:1626, in S3FileSystem._copy_basic(self, path1, path2, **kwargs)
   1624     if ver1:
   1625         copy_src["VersionId"] = ver1
-> 1626     await self._call_s3(
   1627         "copy_object", kwargs, Bucket=buc2, Key=key2, CopySource=copy_src
   1628     )
   1629 except ClientError as e:
   1630     raise translate_boto_error(e)

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py:339, in S3FileSystem._call_s3(self, method, *akwarglist, **kwargs)
    337 logger.debug("CALL: %s - %s - %s", method.__name__, akwarglist, kw2)
    338 additional_kwargs = self._get_s3_method_kwargs(method, *akwarglist, **kwargs)
--> 339 return await _error_wrapper(
    340     method, kwargs=additional_kwargs, retries=self.retries
    341 )

File ~/micromamba/envs/s3fs-test/lib/python3.11/site-packages/s3fs/core.py:139, in _error_wrapper(func, args, kwargs, retries)
    137         err = e
    138 err = translate_boto_error(err)
--> 139 raise err

FileNotFoundError: The specified bucket does not exist

Here some highlighted snippets from the stack trace with variable values.

# core.py:1742
async def _cp_file(...):
    # 'channel1/subdir/dummy_file' = self._strip_protocol('channel1/subdir/dummy_file')
    path1 = self._strip_protocol(path1)

    # `channel1`, `subdir/dummy_file`, None = self.split_path('channel1/subdir/dummy_file')
    # NOTE: split is already wrong semantically, because path does not contain bucket_name,
    # this does cause the error though
    bucket, key, vers = self.split_path(path1)

    ...

    # await self._copy_basic('channel1/subdir/dummy_file', 'channel1/subdir/dummy_file_dst', **{})
    # NOTE: issue has not propagated to this point
    await self._copy_basic(path1, path2, **kwargs)
# core.py:1613
async def _copy_basic(self, path1, path2, **kwargs):
    # `channel1`, `subdir/dummy_file`, None = self.split_path('channel1/subdir/dummy_file')
    buc1, key1, ver1 = self.split_path(path1)

    # `channel1`, `subdir/dummy_file_dst`, None = self.split_path('channel1/subdir/dummy_file_dst')
    buc2, key2, ver2 = self.split_path(path2)

    # NOTE: here the wrong splits seem to have more severe consequences

    ...

    # {"Bucket": `channel1`, "Key": `subdir/dummy_file`}
    copy_src = {"Bucket": buc1, "Key": key1}

    # await self._call_s3("copy_object", kwargs, Bucket=`channel1`, Key=`subdir/dummy_file_dst`, CopySource={"Bucket": `channel1`, "Key": `subdir/dummy_file`})
    await self._call_s3("copy_object", kwargs, Bucket=buc2, Key=key2, CopySource=copy_src)

    ...

Other methods like s3fs.rm seem to suffer from the same bad splits in the path but there
they seem to have no effect even though the bucket name is specified wrongly.

fs.rm_file('channel1/subdir/testfile')

# core.py:1803 
async def _rm_file(self, path, **kwargs):
    # `channel1`, `subdir/testfile`, None = self.split_path('channel1/subdir/testfile')
    bucket, key, _ = self.split_path(path)

    # await self._call_s3("delete_object", Bucket=`channel1`, Key=`subdir/testfile`)
    await self._call_s3("delete_object", Bucket=bucket, Key=key)

After several iterations I came to find the following:

once we patch CopySource to the correct Bucket and Key names, the api call succeeds:

# await self._call_s3("copy_object", kwargs, Bucket=`channel1``, Key=`subdir/dummy_file_dst``, CopySource={"Bucket": `rh-devbox``, "Key": `channel1/subdir/dummy_file``})
await self._call_s3("copy_object", kwargs, Bucket=buc2, Key=key2, CopySource=copy_src)

@martindurant
Copy link
Member

Sorry, I'm not immediately seeing what you mean by "bad splits"

@RobinHolzingerQC
Copy link
Author

When calling self.split_path(...) on channel1/subdir/testfile it yields channel1 (a root level directory) as the bucket name even though the bucket name isn't even encoded in the input string. I think the implementation assumes a path would always include the bucket_name? Strangely using the root-level directory name as bucket name does not cause issues with e.g. s3fs.rm

@martindurant
Copy link
Member

I suppose we expect non-regionalised endpoints in this config; the cache_regions=True makes you specialised per-region clients on demand anyway. Is there any reason to specify the endpoint like this? We wouldn't be able to detect whether its a per-bucket endpoint in the config and try to correct for whether the user provided a bucket-including path or not.

@RobinHolzingerQC
Copy link
Author

@martindurant Thanks for the input, your remark about the non-regionalised endpoints lead me to inspect my endpoint_url again (previously endpoint_url = f"https://{bucket_name}.s3.{region_name}.amazonaws.com/").

It seems that not including the bucket name (endpoint_url = f"https://s3.{region_name}.amazonaws.com/") while still using paths within the <bucket_name>/<path> format, is working! The regionalization of the URL is needed though.

Overall I was a bit unlucky that the previous endpoint_url format including the bucket_name together with no bucket specification in the paths worked for most of the endpoints (without being supported officially).

The docs seem to be a bit outdated concerning the endpoint_url as I had to feed it in through the client_kwargs. Apart from that I'd be fine closing this issue.

@martindurant
Copy link
Member

The docs are talking about endpoints to non-AWS services, which don't seem to have this complication. If you want to add some clarifying text there about your situation, that would be appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants