Ambiguous S3FileSystem.isfile behavior when an object exists with a "/" suffix #800

ianliu · 2023-09-25T14:05:30Z

I was debugging a problem with my pipeline, and I've reduced to the following code snippet:

import boto3
import pandas as pd

df = pd.DataFrame({"foo": ["a", "b"], "bar": [1,2]})
df.to_parquet("s3://my-bucket/dataset/", partition_cols=["foo"])
pd.read_parquet("s3://my-bucket/dataset/") # Works

# Now create an empty object with the dataset name:
s3 = boto3.client("s3")
s3.put_object(Bucket="my-bucket", Key="dataset/", Body=b"")
pd.read_parquet("s3://my-bucket/dataset/") # Throws FileNotFoundError
"""
File /nix/store/avq9131sdfjzn14fsilqgb0x3b76s6k7-python3-3.11.4-env/lib/python3.11/site-packages/pyarrow/fs.py:424, in FSSpecHandler.open_input_file(self, path)
    421 from pyarrow import PythonFile
    423 if not self.fs.isfile(path):
--> 424     raise FileNotFoundError(path)
    426 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: my-bucket/dataset/
"""

This is failing because pyarrow uses S3FileSystem.find("my-bucket/dataset", withdirs=True, detail=True) method to list all files in a partitioned parquet dataset, and s3fs is listing s3://my-bucket/dataset/ as a file and a directory, while S3FileSystem.isfile("s3://my-bucket/dataset/") is returning false. So I think there is an ambiguity happening here, as S3 doesn't have the concept of a directory.

Another example of the problem:

import boto3
import s3fs

boto3.put_object(Bucket="my-bucket", Key="fake_dir/", Body=b"")
boto3.put_object(Bucket="my-bucket", Key="fake_dir/file", Body=b"foo")

fs = s3fs.S3FileSystem()
fs.find("my-bucket/fake_dir", withdirs=True, detail=True)
result = {'my-bucket/fake_dir': {'Key': 'tmp/deleteme5',
  'Size': 0,
  'name': 'my-bucket/fake_dir',
  'StorageClass': 'DIRECTORY',
  'type': 'directory',
  'size': 0},
 'my-bucket/fake_dir/': {'Key': 'my-bucket/fake_dir/',
  'LastModified': datetime.datetime(2023, 9, 25, 13, 54, 48, tzinfo=tzutc()),
  'ETag': '"d41d8cd98f00b204e9800998ecf8427e"',
  'Size': 0,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 0,
  'name': 'my-bucket/fake_dir/'},
 'my-bucket/fake_dir/file': {'Key': 'my-bucket/fake_dir/file',
  'LastModified': datetime.datetime(2023, 9, 25, 13, 54, 54, tzinfo=tzutc()),
  'ETag': '"acbd18db4cc2f85cedef654fccc4a4d8"',
  'Size': 3,
  'StorageClass': 'STANDARD',
  'type': 'file',
  'size': 3,
  'name': 'my-bucket/fake_dir/file'}}

fs.isfile("my-bucket/fake_dir")  # False
fs.isfile("my-bucket/fake_dir/") # False

The text was updated successfully, but these errors were encountered:

martindurant · 2023-09-25T14:29:56Z

I agree that isfile() should return True in this case. @ianthomas23 , yet another edge...

ianthomas23 · 2023-09-26T11:15:13Z

If you spend your whole life using s3fs and never go near boto3 directly then this problem doesn't occur as you cannot have both keys of "something/" and "something" as the former is treated just like the latter. And our test suite mostly checks internal consistency within s3fs so it doesn't test cases like this.

Supporting pre-existing "something/" as a file and "something" as a directory is, quite frankly, going to be a huge amount of work. Pretty much every fsspec or s3fs function calls _strip_protocol which drops trailing slashes, and in the few places where the trailing slash is important it is used to clarify that it means a directory not a file. Here we are looking at the opposite, without a trailing slash it is a directory and with a trailing slash it is a file.

ianliu · 2023-09-26T13:06:19Z

I already "fixed" my pipeline by deleting all fake directory objects. I'm now investigating how they were created in the first place. I couldn't find any code doing something like this in my system. All we use is Pandas + Pyarrow + S3fs. Maybe a combination of version of those is creating the directory objects, don't know. I will investigate more.

@ianthomas23 Feel free to close the issue if you find it not worth it.

martindurant · 2023-09-26T13:08:45Z

The main place I've seen these "placeholders" created is in the AWS S3 console, with the "create directory" button.

ianliu · 2023-09-26T14:21:54Z

In my case, I had an object for every directory in my partitioned parquet. The dataset itself had an object as well. So this makes me think that was done programmatically in some way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguous S3FileSystem.isfile behavior when an object exists with a "/" suffix #800

Ambiguous S3FileSystem.isfile behavior when an object exists with a "/" suffix #800

ianliu commented Sep 25, 2023

martindurant commented Sep 25, 2023

ianthomas23 commented Sep 26, 2023

ianliu commented Sep 26, 2023

martindurant commented Sep 26, 2023

ianliu commented Sep 26, 2023

Ambiguous S3FileSystem.isfile behavior when an object exists with a "/" suffix #800

Ambiguous S3FileSystem.isfile behavior when an object exists with a "/" suffix #800

Comments

ianliu commented Sep 25, 2023

martindurant commented Sep 25, 2023

ianthomas23 commented Sep 26, 2023

ianliu commented Sep 26, 2023

martindurant commented Sep 26, 2023

ianliu commented Sep 26, 2023