Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force nightly pyarrow in the upstream build #8993

Merged
merged 8 commits into from May 10, 2022

Conversation

jorisvandenbossche
Copy link
Member

Similar as #8281, it's still not fully clear why it is not automatically picking up the most recent version

@jorisvandenbossche
Copy link
Member Author

It seems the test is hanging after test_modification_time_read_bytes, but so if I look at the log output on main where it succeeds, the next test is:

dask/bytes/tests/test_s3.py::test_modification_time_read_bytes PASSED    [ 41%]
dask/bytes/tests/test_s3.py::test_parquet[True-pyarrow] PASSED           [ 41%]

So which is a pyarrow related test, and so it might actually be identifying an issue with the latest dask / pyarrow combo.

@jorisvandenbossche
Copy link
Member Author

So I can reproduce this locally:

$ pytest dask/bytes/tests/test_s3.py::test_parquet[True-pyarrow] -vvv -s
...
dask/bytes/tests/test_s3.py::test_parquet[True-pyarrow]  * Running on http://127.0.0.1:5555 (Press CTRL+C to quit)
127.0.0.1 - - [29/Apr/2022 10:42:19] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test/accounts.1.json HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test/accounts.2.json HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/part.0.parquet HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/part.1.parquet HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/_common_metadata HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/_metadata HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "GET /test?list-type=2&prefix=test.parquet%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "GET /test?list-type=2&prefix=test.parquet%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 -

The test is hanging with the above output, and I can't even interrupt it (with ctrl-c) but have to close the terminal.

@jorisvandenbossche
Copy link
Member Author

From some debugging locally, it seems that inspecting a parquet file (pyarrow.dataset.ParquetFileFormat.inspect()) is the one that is hanging, when it is using an s3fs filesystem:

A reproducible test for dask (using the moto server based fixtures in dask/bytes/tests/test_s3.py):

def test_parquet_hangs(s3, s3so):
    import s3fs

    dd = pytest.importorskip("dask.dataframe")
    pd = pytest.importorskip("pandas")
    np = pytest.importorskip("numpy")
    pytest.importorskip("pyarrow")

    url = "s3://%s/test.parquet" % test_bucket_name

    data = pd.DataFrame({"col": np.arange(1000, dtype=np.int64)})
    df = dd.from_pandas(data, chunksize=500)
    df.to_parquet(url, engine="pyarrow", storage_options=s3so)

    # get fsspec filesystem
    from fsspec.core import get_fs_token_paths
    fs, _, paths = get_fs_token_paths(url, mode="rb", storage_options=s3so)

    # inspecting file with pyarrow.dataset hangs
    import pyarrow.dataset as ds
    format = ds.ParquetFileFormat()
    from pyarrow.fs import _ensure_filesystem
    filesystem = _ensure_filesystem(fs)
    format.inspect(paths[0] + "/part.0.parquet", filesystem)

A reproducible test for pyarrow (using the MinIO server based fixtures in pyarrow/tests/test_dataset.py):

@pytest.mark.parquet
@pytest.mark.s3
def test_parquet_inspect_hangs_s3(s3_server):
    from pyarrow.fs import S3FileSystem, _ensure_filesystem
    import pyarrow.dataset as ds

    host, port, access_key, secret_key = s3_server['connection']
    
    # create bucket + file with pyarrow
    fs = S3FileSystem(
        access_key=access_key,
        secret_key=secret_key,
        endpoint_override='{}:{}'.format(host, port),
        scheme='http'
    )
    fs.create_dir("mybucket")
    table = pa.table({'a': [1, 2, 3]})
    path = "mybucket/data.parquet"
    with fs.open_output_stream(path) as out:
        pq.write_table(table, out)

    # read using fsspec filesystem
    import s3fs
    fsspec_fs = s3fs.S3FileSystem(
        key=access_key, secret=secret_key, client_kwargs={"endpoint_url": f"http://{host}:{port}"}
    )
    assert fsspec_fs.ls("mybucket") == ['mybucket/data.parquet']

    # using dataset file format
    format = ds.ParquetFileFormat()
    filesystem = _ensure_filesystem(fsspec_fs)
    schema = format.inspect(path, filesystem)
    assert schema.equals(table.schema)

@jorisvandenbossche
Copy link
Member Author

This seems to be a bug on the pyarrow side, I opened https://issues.apache.org/jira/browse/ARROW-16413

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche for updating the CI environment and debugging this issue. Should we temporarily skip the hanging test and merge this PR in?

@jorisvandenbossche
Copy link
Member Author

I have a PR open to fix this (apache/arrow#13033), so we can probably wait with merging this PR until it is fixed. But in the meantime, I did add temporary skips, so we can at least check the rest of the tests on this PR.

@jorisvandenbossche
Copy link
Member Author

OK, the test build is now finishing. There are still some failures, because of a deprecation warning for ParquetDataset.metadata attribute that gets turned into an error when running tests.

So that means that the pyarrow.dataset engine is still using the legacy ParquetDataset API in some place (xref #8243). cc @rjzamora

@jorisvandenbossche
Copy link
Member Author

Ah, it seems this is only done in a helper function defined in the tests itself:

def check_compression(engine, filename, compression):
if engine == "fastparquet":
pf = fastparquet.ParquetFile(filename)
md = pf.fmd.row_groups[0].columns[0].meta_data
if compression is None:
assert md.total_compressed_size == md.total_uncompressed_size
else:
assert md.total_compressed_size != md.total_uncompressed_size
else:
metadata = pa.parquet.ParquetDataset(filename).metadata
names = metadata.schema.names
for i in range(metadata.num_row_groups):

That should be possible to rewrite to use pq.read_metadata instead

@jcrist
Copy link
Member

jcrist commented May 3, 2022

Ah, it seems this is only done in a helper function defined in the tests itself:

Nice catch, I'll push a fix up for this.

I have a PR open to fix this (apache/arrow#13033), so we can probably wait with merging this PR until it is fixed. But in the meantime, I did add temporary skips, so we can at least check the rest of the tests on this PR.

Thanks Joris! We might have had a user run into this issue last week (never determined if it was pyarrow or fsspec's fault). Hopefully this fixed their problem too 🤞.

@jorisvandenbossche
Copy link
Member Author

Nice catch, I'll push a fix up for this.

Hopefully you didn't start on that yet, as I already included a commit here as well

@jcrist
Copy link
Member

jcrist commented May 3, 2022

I didn't, thanks for letting me know :)

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche!

we can probably wait with merging this PR until it is fixed

So after removing the skips and rerunning CI, say tomorrow, after there is a new nightly then this should be good to go

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

It's still picking up the nightly package of yesterday. We had a failure that caused a few packages not being uploaded, among which the linux one for py3.9. So will have to retry tomorrow.

@jorisvandenbossche
Copy link
Member Author

This is finally passing now!

@jrbourbeau jrbourbeau changed the title CI: force nightly pyarrow in the upstream build Force nightly pyarrow in the upstream build May 10, 2022
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray -- thanks @jorisvandenbossche!

@jrbourbeau jrbourbeau merged commit d652c53 into dask:main May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants