-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array API support with lazy evaluation #26724
Comments
I ran a quick experiment with As you can see in the last commit (a8ab0d3), I did some hackish monkey-patches to make it work. In the end I could successfully call scikit-learn's Note that there is a meta-issue in dask to track Array API compliance:
I reported some the problems I say during my experiment and linked them to that meta-issue. |
I don't know what the status of dask/dask#8750 is, but we can look at adding dask to the compat library if that helps things move faster. On the other hand, it may be unnecessary to do that if they are going to add array_api as a separate namespace because there won't be any backwards compatibility concerns. |
Any opinion on the above @tomwhite / @jrbourbeau? Having a temporary support for dask in array-api-compat while the official But maybe exposing two competing ways to consume dask arrays from array api agnostic code might also introduce unnecessary complexity in the long run. |
I would favour the array-api-compat route, given that NumPy is going to incorporate the Array API standard into its standard namespace (see dask/dask#8750 (comment), and the NumPy 2 work). This would mean that dask/dask#8750 is not needed at all, avoiding another namespace to support. The issue for adding Dask support is data-apis/array-api-compat#17, and it should be fairly straightforward to use some of the work in dask/dask#8750 as a guide at least. |
I think explaining to users that a |
This was discussed a little in #28588. Here's the writeup detailing our conversation there. Summary The major issue with lazy array API implementations in scikit-learn, are that after certain operations (e.g. boolean masking, unique()), the lazy array will end up with a NaN shape, as the shape of the array resulting from a data dependent depend operation is unknown until computation is performed. This is allowed for in the array API spec, as it allows dimensions to be However, becuase scikit-learn depends on the shape being a valid numerical value, this can lead to silent correctness problems or outright errors in scikit-learn, depending on what the shape is used for. As of right now, scikit-learn currently uses
(relevant Github code search: https://github.com/search?q=repo%3Ascikit-learn%2Fscikit-learn+.shape+path%3A%2F%5Esklearn%5C%2F%2F&type=code) Normally, when writing code for e.g. dask, or any other lazy implementation, users can call This is not possible with the Array API, though, as there is no method/function to materialize lazy arrays in the Array API. Furthermore, figuring out when Current failures
Motivation Lazy Array API implementations can provide significant speedups by optimizing computation. Note: dask implements a limited subset of array optimizations. Potential Solutions
To be upstreamed into Calls compute if necessary on the array before returning a non-null shape. (Although Note: This doesn't work in all cases.
|
Which estimator depend on data dependent shapes? |
LDA does at the very least, there's a call to unique. This was removed from #28588, since there's another dask bug that would prevent it from working. |
How does it result in silent correctness problems? Is it because dask uses nan instead of None? |
Yeah, e.g.
or smth like that from |
All classifiers use |
Note that dask also exposes an alternative to compute: Anyways, before make a decision, we should have a look at how other array api candidate libraries with lazy evaluation semantics behave on those patterns (shape of unique and masked assignment with a data dependent boolean mask). Here are libraries to investigate:
|
At the moment, our Array API integration can implicitly assume eager evaluation semantics.
Furthermore we did not test our code to see how it would behave with candidate Array API implementations with lazy evaluation semantics (e.g. dask, jax, ...).
The purpose of this issue is to track what would be needed to make our Array API estimator work with lazy evaluation.
I think we should first investigate where this breaks, then decide whether or not we would like to support lazy evaluation semantics in scikit-learn via the Array API support and if so open a meta-issue to add common test with an estimator tag and progressively fix the Array API estimators to deal with lazy evaluation.
Note that are particular point about lazy vs eager evaluation for Array API consuming libraries is being discussed here:
The text was updated successfully, but these errors were encountered: