Force nightly pyarrow in the upstream build #8993

jorisvandenbossche · 2022-04-28T13:16:18Z

Similar as #8281, it's still not fully clear why it is not automatically picking up the most recent version

jorisvandenbossche · 2022-04-29T08:22:18Z

It seems the test is hanging after test_modification_time_read_bytes, but so if I look at the log output on main where it succeeds, the next test is:

dask/bytes/tests/test_s3.py::test_modification_time_read_bytes PASSED    [ 41%]
dask/bytes/tests/test_s3.py::test_parquet[True-pyarrow] PASSED           [ 41%]

So which is a pyarrow related test, and so it might actually be identifying an issue with the latest dask / pyarrow combo.

jorisvandenbossche · 2022-04-29T08:47:44Z

So I can reproduce this locally:

$ pytest dask/bytes/tests/test_s3.py::test_parquet[True-pyarrow] -vvv -s
...
dask/bytes/tests/test_s3.py::test_parquet[True-pyarrow]  * Running on http://127.0.0.1:5555 (Press CTRL+C to quit)
127.0.0.1 - - [29/Apr/2022 10:42:19] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test/accounts.1.json HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test/accounts.2.json HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/part.0.parquet HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/part.1.parquet HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/_common_metadata HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "PUT /test/test.parquet/_metadata HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "GET /test?list-type=2&prefix=test.parquet%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 -
127.0.0.1 - - [29/Apr/2022 10:42:19] "GET /test?list-type=2&prefix=test.parquet%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 -

The test is hanging with the above output, and I can't even interrupt it (with ctrl-c) but have to close the terminal.

jorisvandenbossche · 2022-04-29T12:13:59Z

From some debugging locally, it seems that inspecting a parquet file (pyarrow.dataset.ParquetFileFormat.inspect()) is the one that is hanging, when it is using an s3fs filesystem:

A reproducible test for dask (using the moto server based fixtures in dask/bytes/tests/test_s3.py):

def test_parquet_hangs(s3, s3so):
    import s3fs

    dd = pytest.importorskip("dask.dataframe")
    pd = pytest.importorskip("pandas")
    np = pytest.importorskip("numpy")
    pytest.importorskip("pyarrow")

    url = "s3://%s/test.parquet" % test_bucket_name

    data = pd.DataFrame({"col": np.arange(1000, dtype=np.int64)})
    df = dd.from_pandas(data, chunksize=500)
    df.to_parquet(url, engine="pyarrow", storage_options=s3so)

    # get fsspec filesystem
    from fsspec.core import get_fs_token_paths
    fs, _, paths = get_fs_token_paths(url, mode="rb", storage_options=s3so)

    # inspecting file with pyarrow.dataset hangs
    import pyarrow.dataset as ds
    format = ds.ParquetFileFormat()
    from pyarrow.fs import _ensure_filesystem
    filesystem = _ensure_filesystem(fs)
    format.inspect(paths[0] + "/part.0.parquet", filesystem)

A reproducible test for pyarrow (using the MinIO server based fixtures in pyarrow/tests/test_dataset.py):

@pytest.mark.parquet
@pytest.mark.s3
def test_parquet_inspect_hangs_s3(s3_server):
    from pyarrow.fs import S3FileSystem, _ensure_filesystem
    import pyarrow.dataset as ds

    host, port, access_key, secret_key = s3_server['connection']
    
    # create bucket + file with pyarrow
    fs = S3FileSystem(
        access_key=access_key,
        secret_key=secret_key,
        endpoint_override='{}:{}'.format(host, port),
        scheme='http'
    )
    fs.create_dir("mybucket")
    table = pa.table({'a': [1, 2, 3]})
    path = "mybucket/data.parquet"
    with fs.open_output_stream(path) as out:
        pq.write_table(table, out)

    # read using fsspec filesystem
    import s3fs
    fsspec_fs = s3fs.S3FileSystem(
        key=access_key, secret=secret_key, client_kwargs={"endpoint_url": f"http://{host}:{port}"}
    )
    assert fsspec_fs.ls("mybucket") == ['mybucket/data.parquet']

    # using dataset file format
    format = ds.ParquetFileFormat()
    filesystem = _ensure_filesystem(fsspec_fs)
    schema = format.inspect(path, filesystem)
    assert schema.equals(table.schema)

jorisvandenbossche · 2022-04-29T13:24:43Z

This seems to be a bug on the pyarrow side, I opened https://issues.apache.org/jira/browse/ARROW-16413

jrbourbeau

Thanks @jorisvandenbossche for updating the CI environment and debugging this issue. Should we temporarily skip the hanging test and merge this PR in?

jorisvandenbossche · 2022-05-03T09:00:18Z

I have a PR open to fix this (apache/arrow#13033), so we can probably wait with merging this PR until it is fixed. But in the meantime, I did add temporary skips, so we can at least check the rest of the tests on this PR.

jorisvandenbossche · 2022-05-03T09:57:37Z

OK, the test build is now finishing. There are still some failures, because of a deprecation warning for ParquetDataset.metadata attribute that gets turned into an error when running tests.

So that means that the pyarrow.dataset engine is still using the legacy ParquetDataset API in some place (xref #8243). cc @rjzamora

jorisvandenbossche · 2022-05-03T10:03:50Z

Ah, it seems this is only done in a helper function defined in the tests itself:

dask/dask/dataframe/io/tests/test_parquet.py

Lines 1735 to 1746 in 4d6a5f0

    
           def check_compression(engine, filename, compression): 
        
               if engine == "fastparquet": 
        
                   pf = fastparquet.ParquetFile(filename) 
        
                   md = pf.fmd.row_groups[0].columns[0].meta_data 
        
                   if compression is None: 
        
                       assert md.total_compressed_size == md.total_uncompressed_size 
        
                   else: 
        
                       assert md.total_compressed_size != md.total_uncompressed_size 
        
               else: 
        
                   metadata = pa.parquet.ParquetDataset(filename).metadata 
        
                   names = metadata.schema.names 
        
                   for i in range(metadata.num_row_groups):

That should be possible to rewrite to use pq.read_metadata instead

jcrist · 2022-05-03T20:27:08Z

Ah, it seems this is only done in a helper function defined in the tests itself:

Nice catch, I'll push a fix up for this.

I have a PR open to fix this (apache/arrow#13033), so we can probably wait with merging this PR until it is fixed. But in the meantime, I did add temporary skips, so we can at least check the rest of the tests on this PR.

Thanks Joris! We might have had a user run into this issue last week (never determined if it was pyarrow or fsspec's fault). Hopefully this fixed their problem too 🤞.

jorisvandenbossche · 2022-05-03T21:31:33Z

Nice catch, I'll push a fix up for this.

Hopefully you didn't start on that yet, as I already included a commit here as well

jcrist · 2022-05-03T21:43:36Z

I didn't, thanks for letting me know :)

jcrist

LGTM!

jrbourbeau

Thanks @jorisvandenbossche!

we can probably wait with merging this PR until it is fixed

So after removing the skips and rerunning CI, say tomorrow, after there is a new nightly then this should be good to go

dask/dataframe/io/tests/test_parquet.py

jorisvandenbossche · 2022-05-05T12:40:30Z

It's still picking up the nightly package of yesterday. We had a failure that caused a few packages not being uploaded, among which the linux one for py3.9. So will have to retry tomorrow.

jorisvandenbossche · 2022-05-09T12:24:06Z

This is finally passing now!

jrbourbeau

Hooray -- thanks @jorisvandenbossche!

jorisvandenbossche added 2 commits April 28, 2022 15:14

CI: force nightly pyarrow in the upstream build

0a872af

test-upstream

984b39f

jrbourbeau reviewed Apr 29, 2022

View reviewed changes

temporary skip hanging tests [test-upstream]

f5fa063

fix check_compression [test-upstream]

d7be2e9

github-actions bot added dataframe io labels May 3, 2022

jcrist approved these changes May 3, 2022

View reviewed changes

jrbourbeau reviewed May 3, 2022

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

jorisvandenbossche added 2 commits May 4, 2022 13:00

remove temp skips [test-upstream]

626d1dc

[test-upstream]

33cced5

jorisvandenbossche added 2 commits May 6, 2022 11:39

[test-upstream]

8afb7ea

[test-upstream]

0c71371

rjzamora mentioned this pull request May 10, 2022

Parquet compression tests failing on CI #9060

Closed

jrbourbeau changed the title ~~CI: force nightly pyarrow in the upstream build~~ Force nightly pyarrow in the upstream build May 10, 2022

jrbourbeau approved these changes May 10, 2022

View reviewed changes

jrbourbeau merged commit d652c53 into dask:main May 10, 2022

jakirkham mentioned this pull request May 10, 2022

Improve object handling & testing of ensure_unicode #9059

Merged

3 tasks

jorisvandenbossche deleted the ci-upstream-arrow branch May 10, 2022 07:35

erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022

Force nightly pyarrow in the upstream build (dask#8993)

bbeaa03

jorisvandenbossche mentioned this pull request May 17, 2022

CI: force nightly pyarrow in the upstream build #9095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force nightly pyarrow in the upstream build #8993

Force nightly pyarrow in the upstream build #8993

jorisvandenbossche commented Apr 28, 2022

jorisvandenbossche commented Apr 29, 2022

jorisvandenbossche commented Apr 29, 2022

jorisvandenbossche commented Apr 29, 2022

jorisvandenbossche commented Apr 29, 2022

jrbourbeau left a comment

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jcrist commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jcrist commented May 3, 2022

jcrist left a comment

jrbourbeau left a comment

jorisvandenbossche commented May 5, 2022

jorisvandenbossche commented May 9, 2022

jrbourbeau left a comment

Force nightly pyarrow in the upstream build #8993

Force nightly pyarrow in the upstream build #8993

Conversation

jorisvandenbossche commented Apr 28, 2022

jorisvandenbossche commented Apr 29, 2022

jorisvandenbossche commented Apr 29, 2022

jorisvandenbossche commented Apr 29, 2022

jorisvandenbossche commented Apr 29, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jcrist commented May 3, 2022

jorisvandenbossche commented May 3, 2022

jcrist commented May 3, 2022

jcrist left a comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 5, 2022

jorisvandenbossche commented May 9, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment