New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: implement median #3819
MAINT: implement median #3819
Conversation
The documentation of |
I've modified the docstring, and copied it to both functions. |
I don't think it's a good idea to implement even an approximate version of median() without any accuracy guarantees. See here for discussion about dask's percentile: #1225 A version based on |
Should we drop Could use |
Yes, I'd support dropping percentile. It would be better to use
apply_gufunc than apply_along_axis because it does vectorised calculations
over each chunk.
…On Thu, Aug 2, 2018 at 7:46 AM jakirkham ***@***.***> wrote:
Should we drop percentile then? It seems weird for us to provide one and
not the other.
Could use da.apply_along_axis for this sort of thing. Though that only
works for a single axis. Definitely wouldn't work for the whole array.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3819 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1m3yrXkdvaTn-2b-WciSilJjhWPOks5uMxDZgaJpZM4ViPfK>
.
|
How does that work with median? |
Something like:
Applying this along axes other than the last would require a bit of dimension shuffling but would not be too difficult. We could/should probably add this into the |
Though there would need to be an explicit rechunking step for the axis (or axes) in question, right? |
It might be interesting to investigate auto-rechunking for apply_gufunc
…On Thu, Aug 2, 2018 at 3:49 PM, jakirkham ***@***.***> wrote:
Though there would need to be an explicit rechunking step for the axis (or
axes) in question, right?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3819 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszJwWXk81hIpbm3i9NszOB5rhkeqbks5uM1fegaJpZM4ViPfK>
.
|
Yes, automatic rechunking would be something to consider, perhaps based on whatever we use in general for |
To return to the question of percentile and median, it sounds like we are ok with one of the following:
Does that sound like an accurate summary? Anything missing above? |
I'm not entirely comfortable with dropping percentile. It sees active use,
despite the lack of theory around it.
…On Thu, Aug 2, 2018 at 4:15 PM, jakirkham ***@***.***> wrote:
To return to the question of percentile and median, it sounds like we are
ok with one of the following:
1. Drop percentile and exclude median.
2. Fix percentile to work correctly and allow median.
Does that sound like an accurate summary? Anything missing above?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3819 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszHwPHpSkaBbG3Lb8_zdPTjdvaZzIks5uM138gaJpZM4ViPfK>
.
|
So perhaps it would make more sense to fix percentile first to work correctly (e.g., by using For sufficiently randomly ordered inputs, the current percentile algorithm would work fine. But for some unknown fraction of cases, it returns silently incorrect results with no error bound. I would be OK imposing some pain on existing users to eliminate silent bugs, e.g., by introducing a new keyword argument |
Agreed fixing it sounds like the right move. Seems there have already been several issues logged with the current implementation. ( #731 ) ( #1212 ) ( #1225 ) ( #3099 ) ( #3115 ) A few of these have been closed, but more out of understanding of the known limitations than a real fix. That said, of course this is a hard problem and solving it exactly on arbitrarily large N-D data is probably not worth the effort (if it is even possible). It would be good if we could collect some user feedback about when people apply percentile. Do they actually want percentiles on the entire data or are they looking for some piece of it? Is it primarily 1-D data (e.g. time series) or is it higher dimensionality? What compromises are users willing to make ( as there certainly will be a few ;)? Agree an exact algorithm that can only work on a portion of the data might be pretty useful. Agree that mandatory shuffling could be pretty useful. Though I don't know the existing algorithm well. So don't know if there are more specific requirements about the shuffle. Also @jcrist has crick, which we may consider adding as a dependency or (with his ok) moving over. |
I think this PR is ready for merge now given @shoyer's at pydata/xarray#2999 (comment) |
There are also lots of cases where exact median is doable, specifically if we're computing the median only along some axis. My guess is that this is more likely to be the common case for dask array users, (particularly among image processing use cases) but I'm not sure. Given this, I'm not sure how best to handle the approximate situation. |
See also #5575 |
I wonder also if doing something like |
Another issue came up recently dealing with medians (#4362). I think people would derive benefit from even an approximate median. |
@shoyer, does this implementation seem better? |
What seems optimal approach is to provide median function having extra argument that decides if approximation should be computed or exact median. Ideally defaulting to exact median to avoid surprises. |
Closing in favor of #9483 -- thanks all for engaging here |
This PR implements
da.median
anddask.array.Array.median
. It is a simple wrapper aroundda.percentile
.This would close #46
flake8 dask
TODO
dask.dataframe._Frame.median