New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force nightly pyarrow in the upstream build #8993
Conversation
It seems the test is hanging after
So which is a pyarrow related test, and so it might actually be identifying an issue with the latest dask / pyarrow combo. |
So I can reproduce this locally:
The test is hanging with the above output, and I can't even interrupt it (with ctrl-c) but have to close the terminal. |
From some debugging locally, it seems that inspecting a parquet file ( A reproducible test for dask (using the moto server based fixtures in def test_parquet_hangs(s3, s3so):
import s3fs
dd = pytest.importorskip("dask.dataframe")
pd = pytest.importorskip("pandas")
np = pytest.importorskip("numpy")
pytest.importorskip("pyarrow")
url = "s3://%s/test.parquet" % test_bucket_name
data = pd.DataFrame({"col": np.arange(1000, dtype=np.int64)})
df = dd.from_pandas(data, chunksize=500)
df.to_parquet(url, engine="pyarrow", storage_options=s3so)
# get fsspec filesystem
from fsspec.core import get_fs_token_paths
fs, _, paths = get_fs_token_paths(url, mode="rb", storage_options=s3so)
# inspecting file with pyarrow.dataset hangs
import pyarrow.dataset as ds
format = ds.ParquetFileFormat()
from pyarrow.fs import _ensure_filesystem
filesystem = _ensure_filesystem(fs)
format.inspect(paths[0] + "/part.0.parquet", filesystem) A reproducible test for pyarrow (using the MinIO server based fixtures in @pytest.mark.parquet
@pytest.mark.s3
def test_parquet_inspect_hangs_s3(s3_server):
from pyarrow.fs import S3FileSystem, _ensure_filesystem
import pyarrow.dataset as ds
host, port, access_key, secret_key = s3_server['connection']
# create bucket + file with pyarrow
fs = S3FileSystem(
access_key=access_key,
secret_key=secret_key,
endpoint_override='{}:{}'.format(host, port),
scheme='http'
)
fs.create_dir("mybucket")
table = pa.table({'a': [1, 2, 3]})
path = "mybucket/data.parquet"
with fs.open_output_stream(path) as out:
pq.write_table(table, out)
# read using fsspec filesystem
import s3fs
fsspec_fs = s3fs.S3FileSystem(
key=access_key, secret=secret_key, client_kwargs={"endpoint_url": f"http://{host}:{port}"}
)
assert fsspec_fs.ls("mybucket") == ['mybucket/data.parquet']
# using dataset file format
format = ds.ParquetFileFormat()
filesystem = _ensure_filesystem(fsspec_fs)
schema = format.inspect(path, filesystem)
assert schema.equals(table.schema) |
This seems to be a bug on the pyarrow side, I opened https://issues.apache.org/jira/browse/ARROW-16413 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche for updating the CI environment and debugging this issue. Should we temporarily skip the hanging test and merge this PR in?
I have a PR open to fix this (apache/arrow#13033), so we can probably wait with merging this PR until it is fixed. But in the meantime, I did add temporary skips, so we can at least check the rest of the tests on this PR. |
OK, the test build is now finishing. There are still some failures, because of a deprecation warning for So that means that the pyarrow.dataset engine is still using the legacy ParquetDataset API in some place (xref #8243). cc @rjzamora |
Ah, it seems this is only done in a helper function defined in the tests itself: dask/dask/dataframe/io/tests/test_parquet.py Lines 1735 to 1746 in 4d6a5f0
That should be possible to rewrite to use |
Nice catch, I'll push a fix up for this.
Thanks Joris! We might have had a user run into this issue last week (never determined if it was pyarrow or fsspec's fault). Hopefully this fixed their problem too 🤞. |
Hopefully you didn't start on that yet, as I already included a commit here as well |
I didn't, thanks for letting me know :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche!
we can probably wait with merging this PR until it is fixed
So after removing the skips and rerunning CI, say tomorrow, after there is a new nightly then this should be good to go
It's still picking up the nightly package of yesterday. We had a failure that caused a few packages not being uploaded, among which the linux one for py3.9. So will have to retry tomorrow. |
This is finally passing now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hooray -- thanks @jorisvandenbossche!
Similar as #8281, it's still not fully clear why it is not automatically picking up the most recent version