Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slice by dask array of ints #3407

Merged
merged 24 commits into from Jul 5, 2018
Merged

Slice by dask array of ints #3407

merged 24 commits into from Jul 5, 2018

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Apr 15, 2018

Closes #3396 (together with #3405)

Strengths:

  • Supports chunking on the index
  • Very parsimonious of RAM
  • distributed-friendly - you'll get an output chunk for every chunk of the index, and the only thing that will travel across the network will be the contribution of every input chunk to every output chunk.

Limitations:

  • 0D and 1D indexes only
  • no fancy indexing with multiple >=1D arrays (just like there isn't any with numpy indexes)

* Slicing one ``dask.array`` with another ``x[x > 0]``
* Slicing one `~dask.array.Array` with a `~dask.array.Array` of bools ``x[x > 0]``
* Slicing one `~dask.array.Array` with a zero or one-dimensional `~dask.array.Array`
of ints ``x[x.map_blocks(np.argsort)]``
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problematic example as argsort won't behave as expected if x is chunked on the axis it operates. Should be replaced with argtopk as soon as it is merged (#3405).

# Unsupported: 0d numpy array slicers (#3406)
# np.testing.assert_array_equal(x[idx0, idx0], d[idx0, idx0_d])
# np.testing.assert_array_equal(x[idx0, idx], d[idx0, idx_d])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment after fixing #3406


if np.isnan(x.chunks[axis]).any():
raise NotImplementedError("Slicing an array with unknown chunks with a "
"dask.array of ints is not supported")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is feasible but would horribly complicate everything because the graph will have to figure out dynamically the offset and the chunks (see below). I could not think of any strong reason to break my head on it as I think it's a very marginal use case.

token = tokenize(x, idx, axis)
name1 = 'slice_with_int_dask_array_chunk-' + token
name2 = 'slice_with_int_dask_array_aggregate-' + token

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions are welcome to use some kind of wrapper function instead of manually building the graph.
I tried with atop but didn't have much success.

@jakirkham
Copy link
Member

This looks pretty similar to PR ( #3210 ). Might be worth taking a look.

@jakirkham
Copy link
Member

Is it possible to do something like, x[x.nonzero()], with this implementation?

@crusaderky
Copy link
Collaborator Author

crusaderky commented Apr 24, 2018

@jakirkham yes.

>>> import dask.array as da
>>> a = da.arange(-3, 3, chunks=3)
>>> a[a.nonzero()].compute()
array([-3, -2, -1,  1,  2])

[EDIT] to be more precise: the index can have NaN chunks, but the sliced array can't.

@jakirkham
Copy link
Member

Cool, thanks for checking. Could you please add a test along those lines?

@crusaderky
Copy link
Collaborator Author

@jakirkham I added it before you even had time to ask :)

@crusaderky
Copy link
Collaborator Author

@jakirkham any update on this?

@jakirkham
Copy link
Member

Sorry, haven't had time to give it another look. Am pretty busy for the next month. Though this isn't necessarily a bad thing. Would be a great time to get some other thoughts on this. Expect @shoyer's got some good ideas.

@crusaderky crusaderky changed the title Slice by dask array of ints WIP: Slice by dask array of ints May 16, 2018
@crusaderky
Copy link
Collaborator Author

Found a bug - I'll revert once it's ironed out

@crusaderky crusaderky closed this May 19, 2018
@crusaderky crusaderky reopened this May 19, 2018
@crusaderky crusaderky changed the title WIP: Slice by dask array of ints Slice by dask array of ints May 19, 2018
@crusaderky
Copy link
Collaborator Author

@jakirkham fixed bug and enhanced tests - ready for merge...

@jakirkham
Copy link
Member

Sounds good. If you have any thoughts on take_along_axis, feel free to share them in issue ( #3663 ).

@crusaderky
Copy link
Collaborator Author

@djhoese fixed uint index

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the only thing major thing to add here is handling negative indices, or at least being sure that we give a graceful error message?

@@ -7,7 +7,7 @@ Changelog
Array
+++++

- Allow slicing a Dask Array by another one-dimensional Dask Array of integers
-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably add this back? :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed to reduce merge collisions, in accordance to the new policy of the dask project

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this looks good to me. I'll merge this in day or two unless anyone else has comments.

@jakirkham
Copy link
Member

jakirkham commented Jun 27, 2018

Does this work with vindex now? If so, can we have a test (assuming it is not already there and I've overlooked it)?

@crusaderky
Copy link
Collaborator Author

@jakirkham no test, and I didn't check. I'll look into it ASAP.

@crusaderky
Copy link
Collaborator Author

@jakirkham vindex doesn't work, in the sense that the index is silently computed at definition time.
I gave it a brief look (I never saw vindex before) and it looks like something very much NOT trivial. I would humbly suggest avoiding scope creep and opening a new PR after this one is merged.

I wrote initial unit tests for it; feel free to grab them: crusaderky@9cf947a

@jakirkham
Copy link
Member

So I think you can steal the code out of PR ( #3210 ) to accomplish this. It already has the correct dispatch logic in vindex when Dask Arrays are present. Would just drop the atop call and following code, which can replaced with a simple getitem call.

@crusaderky
Copy link
Collaborator Author

@jakirkham @shoyer to me the functionality in vindex is a separate, albeit related, feature. Can we please merge this one and open a new PR if there aren't specific critiques?!?

@shoyer shoyer merged commit 8bc7003 into dask:master Jul 5, 2018
@shoyer
Copy link
Member

shoyer commented Jul 5, 2018

Agreed, vindex is something separate.

Thanks!

@mrocklin
Copy link
Member

mrocklin commented Jul 5, 2018 via email

@crusaderky
Copy link
Collaborator Author

Thanks everybody for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants