Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous S3FileSystem.isfile behavior when an object exists with a "/" suffix #800

Open
ianliu opened this issue Sep 25, 2023 · 5 comments

Comments

@ianliu
Copy link

ianliu commented Sep 25, 2023

I was debugging a problem with my pipeline, and I've reduced to the following code snippet:

import boto3
import pandas as pd

df = pd.DataFrame({"foo": ["a", "b"], "bar": [1,2]})
df.to_parquet("s3://my-bucket/dataset/", partition_cols=["foo"])
pd.read_parquet("s3://my-bucket/dataset/") # Works

# Now create an empty object with the dataset name:
s3 = boto3.client("s3")
s3.put_object(Bucket="my-bucket", Key="dataset/", Body=b"")
pd.read_parquet("s3://my-bucket/dataset/") # Throws FileNotFoundError
"""
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/fs.py:424, in FSSpecHandler.open_input_file(self, path)
    421 from pyarrow import PythonFile
    423 if not self.fs.isfile(path):
--> 424     raise FileNotFoundError(path)
    426 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: my-bucket/dataset/
"""

This is failing because pyarrow uses S3FileSystem.find("my-bucket/dataset", withdirs=True, detail=True) method to list all files in a partitioned parquet dataset, and s3fs is listing s3://my-bucket/dataset/ as a file and a directory, while S3FileSystem.isfile("s3://my-bucket/dataset/") is returning false. So I think there is an ambiguity happening here, as S3 doesn't have the concept of a directory.

Another example of the problem:

import boto3
import s3fs

boto3.put_object(Bucket="my-bucket", Key="fake_dir/", Body=b"")
boto3.put_object(Bucket="my-bucket", Key="fake_dir/file", Body=b"foo")

fs = s3fs.S3FileSystem()
fs.find("my-bucket/fake_dir", withdirs=True, detail=True)
result = {'my-bucket/fake_dir': {'Key': 'tmp/deleteme5',
  'Size': 0,
  'name': 'my-bucket/fake_dir',
  'StorageClass': 'DIRECTORY',
  'type': 'directory',
  'size': 0},
 'my-bucket/fake_dir/': {'Key': 'my-bucket/fake_dir/',
  'LastModified': datetime.datetime(2023, 9, 25, 13, 54, 48, tzinfo=tzutc()),
  'ETag': '"d41d8cd98f00b204e9800998ecf8427e"',
  'Size': 0,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 0,
  'name': 'my-bucket/fake_dir/'},
 'my-bucket/fake_dir/file': {'Key': 'my-bucket/fake_dir/file',
  'LastModified': datetime.datetime(2023, 9, 25, 13, 54, 54, tzinfo=tzutc()),
  'ETag': '"acbd18db4cc2f85cedef654fccc4a4d8"',
  'Size': 3,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 3,
  'name': 'my-bucket/fake_dir/file'}}

fs.isfile("my-bucket/fake_dir")  # False
fs.isfile("my-bucket/fake_dir/") # False
@martindurant
Copy link
Member

I agree that isfile() should return True in this case. @ianthomas23 , yet another edge...

@ianthomas23
Copy link
Contributor

If you spend your whole life using s3fs and never go near boto3 directly then this problem doesn't occur as you cannot have both keys of "something/" and "something" as the former is treated just like the latter. And our test suite mostly checks internal consistency within s3fs so it doesn't test cases like this.

Supporting pre-existing "something/" as a file and "something" as a directory is, quite frankly, going to be a huge amount of work. Pretty much every fsspec or s3fs function calls _strip_protocol which drops trailing slashes, and in the few places where the trailing slash is important it is used to clarify that it means a directory not a file. Here we are looking at the opposite, without a trailing slash it is a directory and with a trailing slash it is a file.

@ianliu
Copy link
Author

ianliu commented Sep 26, 2023

I already "fixed" my pipeline by deleting all fake directory objects. I'm now investigating how they were created in the first place. I couldn't find any code doing something like this in my system. All we use is Pandas + Pyarrow + S3fs. Maybe a combination of version of those is creating the directory objects, don't know. I will investigate more.

@ianthomas23 Feel free to close the issue if you find it not worth it.

@martindurant
Copy link
Member

The main place I've seen these "placeholders" created is in the AWS S3 console, with the "create directory" button.

@ianliu
Copy link
Author

ianliu commented Sep 26, 2023

In my case, I had an object for every directory in my partitioned parquet. The dataset itself had an object as well. So this makes me think that was done programmatically in some way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants