Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3fs 2024.3.0 fails reading glob patterns through pandas #862

Open
calloc opened this issue Mar 17, 2024 · 12 comments
Open

s3fs 2024.3.0 fails reading glob patterns through pandas #862

calloc opened this issue Mar 17, 2024 · 12 comments

Comments

@calloc
Copy link

calloc commented Mar 17, 2024

Intention is to read all CSV files under an s3 directory using Pandas via s3fs.
We specify path as s3://some-workspace/tmp/jr_1d02d47beccb8f7d2c2e8429bbaa3e143f90fcb7837373fb565abdade4a8c845_audit/*.csv

Reverting back to old version s3fs==2024.2.0 works. Note pandas==2.1.4.

[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venv-callsn9i4utg/script.py", line 48, in generate_partition_sql
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     df = pd.read_csv(
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -          ^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/home/***/.local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     return _read(filepath_or_buffer, kwds)
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/home/***/.local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 611, in _read
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     parser = TextFileReader(filepath_or_buffer, **kwds)
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/home/***/.local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     self._engine = self._make_engine(f, self.engine)
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/home/***/.local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     self.handles = get_handle(
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -                    ^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/home/***/.local/lib/python3.11/site-packages/pandas/io/common.py", line 718, in get_handle
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     ioargs = _get_filepath_or_buffer(
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -              ^^^^^^^^^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/home/***/.local/lib/python3.11/site-packages/pandas/io/common.py", line 420, in _get_filepath_or_buffer
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     ).open()
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -       ^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/core.py", line 135, in open
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     return self.__enter__()
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -            ^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/core.py", line 103, in __enter__
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     f = self.fs.open(self.path, mode=mode)
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/spec.py", line 1293, in open
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     f = self._open(
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -         ^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/s3fs/core.py", line 685, in _open
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     return S3File(
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -            ^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/s3fs/core.py", line 2179, in __init__
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     super().__init__(
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/spec.py", line 1651, in __init__
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     self.size = self.details["size"]
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -                 ^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/spec.py", line 1664, in details
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     self._details = self.fs.info(self.path)
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -                     ^^^^^^^^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     return sync(self.loop, func, *args, **kwargs)
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     raise return_result
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     result[0] = await coro
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -                 ^^^^^^^^^^
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -   File "/tmp/venvogubdfca/lib/python3.11/site-packages/s3fs/core.py", line 1418, in _info
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO -     raise FileNotFoundError(path)
[2024-03-17, 02:39:41 +07] {process_utils.py:190} INFO - FileNotFoundError: some-workspace/tmp/jr_1d02d47beccb8f7d2c2e8429bbaa3e143f90fcb7837373fb565abdade4a8c845_audit/*.csv
[2024-03-17, 02:39:41 +07] {taskinstance.py:2698} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
    result = execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/decorators/base.py", line 241, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 400, in execute
    return super().execute(context=serializable_context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 199, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 716, in execute_callable
    result = self._execute_python_callable_in_subprocess(python_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 471, in _execute_python_callable_in_subprocess
    raise AirflowException(error_msg) from None
airflow.exceptions.AirflowException: Process returned non-zero exit status 1.
some-workspace/tmp/jr_1d02d47beccb8f7d2c2e8429bbaa3e143f90fcb7837373fb565abdade4a8c845_audit/*.csv
@martindurant
Copy link
Member

This is a consequence of fsspec/filesystem_spec#1536 (cc @Skylion007 ), and the logic is: pandas only reads single files, but glob can return many. So with the previous version, we would simply pick the first match, but now we assume single paths are not to be expanded, as that would cause ambiguity. The glob() methods still functions as before, and open_files will expand (e.g., as used by dask's read_csv ).

@martindurant
Copy link
Member

@Skylion007 , do you think we should fall back to the old behaviour in the case that no file is found with the exact given path, and it contains special glob characters?

@Skylion007
Copy link

The old behavior can be manually specified by passing expand=True if that is actually the behavior you desire.

@martindurant
Copy link
Member

The old behavior can be manually specified by passing expand=True if that is actually the behavior you desire.

> pd.read_csv("gcs://mdtemp/*.csv", storage_options=dict(expand=True))
TypeError: fsspec.core.open_files() got multiple values for keyword argument 'expand'

@Skylion007
Copy link

Ugh... maybe allowing that arg to be overridden better would be a nicer place to start?

@martindurant
Copy link
Member

I don't think most users will want to add an extra kwarg to their calls (right, @calloc ?)

@Skylion007
Copy link

I am bit confused about the old behavior, wouldn't the previous glob pattern only return the first element if it went through fsspec.open(). fsspec.open shouldn't return a list.

@martindurant
Copy link
Member

fsspec.open returned fsspec.open_files(..)[0]

@martindurant
Copy link
Member

The following fixes passing expand=False

--- a/fsspec/core.py
+++ b/fsspec/core.py
@@ -456,6 +456,8 @@ def open(
     - For implementations in separate packages see
       https://filesystem-spec.readthedocs.io/en/latest/api.html#other-known-implementations
     """
+    kw = {"expand": False}
+    kw.update(kwargs)
     out = open_files(
         urlpath=[urlpath],
         mode=mode,
@@ -464,8 +466,7 @@ def open(
         errors=errors,
         protocol=protocol,
         newline=newline,
-        expand=False,
-        **kwargs,
+        **kw,
     )
     if not out:

but simply falling back if open(expand=False) fails does not work, because in the current implementation, open() always succeeds whether or not the target file exists

In [2]: fsspec.open("s3://completely/unknown")
Out[2]: <OpenFile 'completely/unknown'>

and only fails when this OpenFile is actually used. OTOH, we don't really want to check for every concrete filepath passed whether it exists or not, and assume the caller have reasonable parameters. So in the case that the path can be interpreted in multiple ways (i.e., has globp characters), we need special handling.

Thoughts, @Skylion007 ?

@martindurant
Copy link
Member

I fixed expand= in fsspec/filesystem_spec#1551 , which made release 2024.3.1, but I am still considering the right default value, depending on how many people were relying on the previous (accidental) behaviour.

@Skylion007
Copy link

@martindurant This does feel like a trap since it pandas csv will only read 1 glob file in this case, but i suppose supporting this use case is better than not doing so. i was thinking that we may want to refactor it a bit so expand could only be overriden in fsspec.open() if we want to keep the old erroring behavior moving forward (by using kwargs.setdefault or something similar),

@martindurant
Copy link
Member

Following my PR, changing the default is easy, although I suppose it could be a config option rather than/as well as a kwarg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants