Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: implement median #3819

Closed
wants to merge 5 commits into from
Closed

MAINT: implement median #3819

wants to merge 5 commits into from

Conversation

stsievert
Copy link
Member

@stsievert stsievert commented Jul 26, 2018

This PR implements da.median and dask.array.Array.median. It is a simple wrapper around da.percentile.

This would close #46

  • Tests added and pass
  • Passes flake8 dask

TODO

@jakirkham
Copy link
Member

The documentation of percentile says it is approximate, which makes sense. How important is it to have the exact median? Should we add a similar note in da.median's docs?

@stsievert
Copy link
Member Author

I've modified the docstring, and copied it to both functions.

@shoyer
Copy link
Member

shoyer commented Jul 30, 2018

I don't think it's a good idea to implement even an approximate version of median() without any accuracy guarantees. See here for discussion about dask's percentile: #1225

A version based on dask.array.apply_gufunc would be fine, but it won't work in cases where the aggregated dimensions exists in multiple chunks.

@jakirkham
Copy link
Member

Should we drop percentile then? It seems weird for us to provide one and not the other.

Could use da.apply_along_axis for this sort of thing. Though that only works for a single axis. Definitely wouldn't work for the whole array.

@shoyer
Copy link
Member

shoyer commented Aug 2, 2018 via email

@jakirkham
Copy link
Member

How does that work with median?

@shoyer
Copy link
Member

shoyer commented Aug 2, 2018

Something like:

import dask.array as da
import numpy as np

x = da.random.random((100, 100), chunks=(10, 100))
res = da.apply_gufunc(lambda x: np.median(x, axis=-1), '(x)->()', x, output_dtypes=x.dtype)

Applying this along axes other than the last would require a bit of dimension shuffling but would not be too difficult. We could/should probably add this into the apply_gufunc() interface (#3843).

@jakirkham
Copy link
Member

Though there would need to be an explicit rechunking step for the axis (or axes) in question, right?

@mrocklin
Copy link
Member

mrocklin commented Aug 2, 2018 via email

@shoyer
Copy link
Member

shoyer commented Aug 2, 2018

Yes, automatic rechunking would be something to consider, perhaps based on whatever we use in general for apply_gufunc() (the same issues apply).

@jakirkham
Copy link
Member

To return to the question of percentile and median, it sounds like we are ok with one of the following:

  1. Drop percentile and exclude median.
  2. Fix percentile to work correctly and allow median.

Does that sound like an accurate summary? Anything missing above?

@mrocklin
Copy link
Member

mrocklin commented Aug 2, 2018 via email

@shoyer
Copy link
Member

shoyer commented Aug 2, 2018

So perhaps it would make more sense to fix percentile first to work correctly (e.g., by using apply_gufunc()). This would be a good idea even if the new algorithm doesn't work in every case (which it won't, since it requires that the aggregated dimension is unchunked).

For sufficiently randomly ordered inputs, the current percentile algorithm would work fine. But for some unknown fraction of cases, it returns silently incorrect results with no error bound.

I would be OK imposing some pain on existing users to eliminate silent bugs, e.g., by introducing a new keyword argument assume_randomly_ordered_input=False and requiring it to be explicitly flipped to True to use the existing algorithm.

@jakirkham
Copy link
Member

Agreed fixing it sounds like the right move. Seems there have already been several issues logged with the current implementation. ( #731 ) ( #1212 ) ( #1225 ) ( #3099 ) ( #3115 ) A few of these have been closed, but more out of understanding of the known limitations than a real fix.

That said, of course this is a hard problem and solving it exactly on arbitrarily large N-D data is probably not worth the effort (if it is even possible). It would be good if we could collect some user feedback about when people apply percentile. Do they actually want percentiles on the entire data or are they looking for some piece of it? Is it primarily 1-D data (e.g. time series) or is it higher dimensionality? What compromises are users willing to make ( as there certainly will be a few ;)?

Agree an exact algorithm that can only work on a portion of the data might be pretty useful.

Agree that mandatory shuffling could be pretty useful. Though I don't know the existing algorithm well. So don't know if there are more specific requirements about the shuffle.

Also @jcrist has crick, which we may consider adding as a dependency or (with his ok) moving over.

@stsievert
Copy link
Member Author

I think this PR is ready for merge now given @shoyer's at pydata/xarray#2999 (comment)

@mrocklin
Copy link
Member

There are also lots of cases where exact median is doable, specifically if we're computing the median only along some axis. My guess is that this is more likely to be the common case for dask array users, (particularly among image processing use cases) but I'm not sure. Given this, I'm not sure how best to handle the approximate situation.

@mrocklin
Copy link
Member

See also #5575

@jakirkham
Copy link
Member

I wonder also if doing something like histogram could be a good way of approximating the median in the cases where an exact value is hard to come by. For small integral types, this could even compute the median exactly on large data.

@jsignell
Copy link
Member

Another issue came up recently dealing with medians (#4362). I think people would derive benefit from even an approximate median.

@jakirkham
Copy link
Member

@shoyer, does this implementation seem better?

@jangorecki
Copy link

What seems optimal approach is to provide median function having extra argument that decides if approximation should be computed or exact median. Ideally defaulting to exact median to avoid surprises.

@jrbourbeau
Copy link
Member

Closing in favor of #9483 -- thanks all for engaging here

@jrbourbeau jrbourbeau closed this Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

da.percentile and np.percentile mismatch median/nanmedian
7 participants