Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Dask Arrow-based strings cause test failures #335

Open
jakirkham opened this issue Aug 3, 2023 · 13 comments
Open

New Dask Arrow-based strings cause test failures #335

jakirkham opened this issue Aug 3, 2023 · 13 comments

Comments

@jakirkham
Copy link
Member

jakirkham commented Aug 3, 2023

(Edited by @m-albert)

In the presence of pyarrow, dask by default assumes dataframes of type object to be pyarrow strings (see dask/dask#10139 (comment)).

This creates problems revealed by failing tests (e.g. test_dask_image/test_ndmeasure/test_find_objects.py::test_3d_find_objects)

meta = dd.utils.make_meta([(i, object) for i in range(ndim)])
if isinstance(df1, Delayed):
df1 = dd.from_delayed(df1, meta=meta)

dd.from_delayed(df1, meta=meta).compute().dtypes

Working install:

0 object
1 object
2 object
dtype: object

Failing install:

0 string[pyarrow]
1 string[pyarrow]
2 string[pyarrow]
dtype: object

The failing test had come up when releasing v2023.08.0 in conda-forge/dask-image-feedstock#14.

@jakirkham found that pyarrow is installed with the conda distribution of dask, but not when installing over pip, where it just part of the [complete] target.

Also @jakirkham found that the above described conflicting behaviour can be turned off using the dask configuration.

He did this for the tests performed by the dask-image conda feedstock on v2023.08.0.

@jakirkham
Copy link
Member Author

Please fill free to edit and fill this issue out Marvin 🙂

Just needed a placeholder for tracking

@jakirkham jakirkham changed the title New Arrow strings cause test failures New Dask Arrow-based strings cause test failures Aug 3, 2023
@jakirkham
Copy link
Member Author

Would also be good to make a note in issue (where feedback is being collected): dask/dask#10139

Ideally with a simple reproducer

@m-albert
Copy link
Collaborator

m-albert commented Aug 3, 2023

Despite the passing tests, potentially users who installed dask-image over conda would still experience the above described problem when using ndmeasure.find_objects (need to check whether there's more affected).

Perhaps a suitable fix for this on the dask-image side would be to set dask.config.set({"dataframe.convert-string": False}) using a context manager around the affected functionality? See this recommendation.

Edit:
Confirmed that using with dask_config.set({'dataframe.convert-string': False}): around

bag = db.from_sequence(arrays)
result = bag.fold(functools.partial(_find_objects, label_image.ndim), split_every=2).to_delayed()
meta = dd.utils.make_meta([(i, object) for i in range(label_image.ndim)])
result = delayed(compute)(result)[0] # avoid the user having to call compute twice on result
result = dd.from_delayed(result, meta=meta, prefix="find-objects-", verify_meta=False)

and
meta = dd.utils.make_meta([(i, object) for i in range(ndim)])
if isinstance(df1, Delayed):
df1 = dd.from_delayed(df1, meta=meta)
if isinstance(df2, Delayed):
df2 = dd.from_delayed(df2, meta=meta)
ddf = dd.merge(df1, df2, how="outer", left_index=True, right_index=True)
result = ddf.apply(_merge_bounding_boxes, ndim=ndim, axis=1, meta=meta)

fixes the errors.

@GenevieveBuckley
Copy link
Collaborator

Would also be good to make a note in issue (where feedback is being collected): dask/dask#10139

Ideally with a simple reproducer

I've made a comment here, but no reproducer (I'm not planning to do more work on this, it's open for anyone who wants it)

@jakirkham
Copy link
Member Author

Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version

Tbh I've not looked deeply into the Dask Arrow work. Have heard about it mainly in passing. So not sure how dask.config handles it

Should add this pain point is not unique to us. We had to disable this feature in Dask-SQL recently as well ( dask-contrib/dask-sql#1206 ). Unclear whether this is due to upstream bugs or if we need to make changes

@GenevieveBuckley
Copy link
Collaborator

Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version

We could add an "upstream" CI environment, that just uses whatever the latest (or even pre-release?) versions are, maybe?

@jakirkham
Copy link
Member Author

There are Dask nightly packages. So that would be easy to add

@m-albert
Copy link
Collaborator

m-albert commented Aug 9, 2023

As far as I understand, for reproducing the test failure in CI, next to a recent dask version we'd need arrow>=7 present in the environment. This is a regular dependency of the conda , but not the pypi dask distribution.

@jakirkham
Copy link
Member Author

Is this still an issue with recent Dask releases? Asking as they may have fixed something upstream since this occurred

@GenevieveBuckley
Copy link
Collaborator

I'm not seeing any flaky/failing tests, so I don't think this is still happening currently. I'll close the issue, and if it pops up again we can re-open.

@jakirkham
Copy link
Member Author

Yeah I think part of the issue before was CI doesn't capture this edge case. Though maybe it should

@m-albert
Copy link
Collaborator

m-albert commented Mar 8, 2024

Reopening as this issue just came up in #355.

@jakirkham
Copy link
Member Author

Think this may have been fixed in the intervening time. The test suite no longer fails for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants