Update scikit-learn to 1.4 #5851

betatim · 2024-04-17T13:38:29Z

This is an attempt to update the scikit-learn dependency from 1.2 to 1.4. Most changes are related to constructor arguments that were deprecated in 1.2 and in 1.4 have changed/been removed.

A question I have is what cuml's deprecation policy is? I've gone with "two releases" for parameters where we can easily do so (deprecated in 24.06 and then remove them in 24.10). However that is only about 4 months of deprecation which could be a bit short.

Some of the changes would be hard to do as a deprecation (with 1.4 there is no way to provide the "old way"), we'd have to stick with 1.3 for now. I think this is a bit of a bummer but maybe the price to pay for not keeping on top of deprecations. And it seems like there is no deprecation policy in the docs/towards users? So maybe we can play this card once now, to catch up and at the same time introduce a deprecation policy.

The SHAP test needed its reference updating. I am not sure why, at least I couldn't quickly find a reason for why you'd have to do this.

I am not sure how possible it would be to support a range of scikit-learn versions (say 1.2 - 1.4). Would be cool but maybe not worth the added complexity?

Todo:

xref #5799

copy-pr-bot · 2024-04-17T13:38:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bdice · 2024-04-17T14:02:27Z

cuML closely mirrors scikit-learn, just as cuDF closely mirrors pandas. This alignment led us to a one-release deprecation policy in cuDF. I think it is totally fine to favor alignment with the target library over the stability of slower deprecation/removal. I would vote to deprecate anything you can in 24.06 with removal in 24.08. For anything you can’t safely deprecate, I would consider just making breaking changes to align with the latest scikit-learn.

https://docs.rapids.ai/api/cudf/stable/developer_guide/contributing_guide/#deprecating-and-removing-code

betatim · 2024-04-19T06:45:25Z

If there is prior art within RAPIDS then it makes some amount of sense to follow that (aka copy cudf). For me the important factor in deciding is how much time (in months) do we need to/want to give users to deal with deprecations. scikit-learn is quite conservative and gives you ~1 year to deal with things (1.2 deprecates something with a warning, six months later 1.3 continues to warn, another six months later 1.4 removes the warning and old behaviour). This means you have to deal with deprecations for quite a while (as a scikit-learn developer).

How did cudf decide on the mark as deprecated and warn in 22.08, remove in 22.10 cycle? That seems much quicker than pandas. From the doc you linked it sounds like they deprecate something with a warning in v1.x and will continue to warn until they release 2.0. My assumption is that a lot more than a few months pass between v1.x and v2.0. Potentially even longer than scikit-learn's ~1 year policy.

Overall I'd vote for using the same deprecation cycle and policy for cudf and cuml (and other RAPIDS packages). What ever the pros and cons of each policy might be, they are all outweighed by having something consistent across RAPIDS IMHO.

vyasr · 2024-04-20T00:22:26Z

tl;dr I think a 1-release deprecation cycle for cuml-specific behavior that is not matched by scikit-learn is fine, but if you have a deprecated code path that provides support for behavior that is deprecated in scikit-learn too, it's probably worth the effort of maintaining it as long as they do unless it starts posing a heavy undue burden.

We should differentiate between cudf's own deprecation policy and the policy cudf follows for matching pandas deprecations. If pandas deprecates something, we deprecate it as well in cudf, but we won't remove it until pandas does. That can be a very long time, e.g. DataFrame.append was deprecated in cudf for over a year. OTOH, if cudf has behavior that doesn't currently match pandas exactly and we decide to make some change in that realm, we follow the deprecation policy that Bradley linked to above. IOW, if you have to break current users of some cuml API to better align with scikit-learn and help future users migrate seamlessly from scikit-learn to cuml, that is a good tradeoff to make on a short (one-release) timescale.

While I do think we should try and align cudf and cuml (and other RAPIDS libraries) as much as possible in this regard, it is worth keeping in mind that the situations are a bit different. In general the scikit-learn API is far less expansive and has fewer sharp edges than the pandas API. Many pandas functions have very large API surfaces, both in terms of the number of parameters and the degree of flexibility in the ways that the corresponding arguments may be specified. Furthermore, there are a lot more behavioral differences that often cause headaches, usually around cases like null handling, nulls vs nans, empty columns, all-null columns, etc. We have historically had many cases where either specific combinations of parameters were intentionally doing something that we considered more sensible than the default pandas choice (possibly because we supported truly nullable data while pandas historically conflated nulls and nans), or some combination of arguments hits a data-specific edge case that we simply hadn't considered before. Especially now with cudf.pandas, more of those are things we would classify as bugs, but in the past some of these are cases that we classified as deprecating behavior in cudf to better align with pandas. I don't think you'll have nearly as much of an issue with scikit-learn, which in my experience has made very carefully designed API choices from the beginning and also has a somewhat simpler problem to deal with from an API perspective (data wrangling APIs are some of the hardest to design, whereas in ML much more of the complexity is algorithmic in nature).

betatim · 2024-04-22T12:57:53Z

Thinking out loud:

In vX of cuml it would depend on the latest released version of scikit-learn, say vY. The deprecations would mirror those of vY. Say some class' constructor argument is deprecated in vY, then cuml will also emit a warning. At some point in the future cuml vX+N will be released that depends on the then current version of scikit-learn which is vY+2 and the deprecations from vY will now be gone. This is because scikit-learn deprecations go for two releases.

This means work on cuml now, should attempt to match the deprecations of scikit-learn v1.5 which will probably be released in May, before cuml 24.06 comes out.

During the transition phase from the current state of cuml to this "in sync" state there will be a few breaking changes. This is because cuml currently depends on scikit-learn 1.2, so in order to switch to v1.5 we will have to make breaking changes. I think even changing to v1.4 requires a breaking change, but it is in LARS which is marked experimental.

I think matching the deprecations makes sense if you are in sync with scikit-learn. What do people think of this? I think it is probably desirable to be in sync.

A practical snag in the above, but solvable somehow, is that it is unknowable when a version of scikit-learn will be released. The target is to release roughly every six months, but this isn't a firm commitment. For example if it looks like a release would come out just before the end of December, it is likely to be moved to January. This means when we write the deprecation warning in cuml, we need to make an educated guess as to the version of cuml in which the deprecation expires.

vyasr · 2024-04-23T01:34:10Z

I think that strategy makes sense. One good way that you can verify this is by running your tests with warnings as errors. cudf already does this. That forces us to catch warnings from APIs that we expect to throw them, and when we write such tests we also catch the warnings from the corresponding cudf API. That allows us to verify that we're throwing the same warning as pandas. cuml might need to do more work to get its tests running without warnings before that is viable, though.

You're right that not knowing the exact schedule is a bit tricky. At that point I think it's mostly a social question of what you want to promise. You could avoid putting a specific removal timeline in the deprecation message, you could include a best-effort estimate, or you could say something like "will be removed by X if not sooner" if you can confidently upper bound. I think that's largely up to you.

betatim · 2024-05-03T12:39:21Z

I've updated the "will be removed by" version to 25.08. My best guess for when scikit-learn 1.7 will be released.

This PR updates cuml to scikit-learn 1.4, but 1.5 will probably be out in a week or two. Likely before 24.06 will be released, so maybe we merge this PR now and update cuml to scikit-learn 1.5 if it has been released in time.

dantegd · 2024-05-06T23:42:48Z

@betatim merged branch-24.06 into the PR to trigger CI

bdice · 2024-05-07T19:56:03Z

/ok to test

Mostly deals with parameters that were deprecated in scikit-learn 1.2 and are now no longer available. Update end version of deprecation Update version in which deprecation expires Change argument name to deal with deprecation

betatim · 2024-05-23T07:58:37Z

I started working on updating to 1.5 which was released yesterday

betatim · 2024-05-24T16:10:50Z

I'll take a look at the failing metrics comparison tests on Monday if no one else has an idea. A little unclear why suddenly the metrics should be different/need less precision to get the comparison to be equal.

dantegd · 2024-05-24T18:06:21Z

The distance metrics seem to only differ very slightly on ARM, I wonder if it's a change in sklearn or a dependency of it, but from a first looks seems too small to warrant worries

The arguments to r2_score were the wrong way around, this might explain the sudden decrease in the score which leads to failing tests.

betatim · 2024-05-27T12:23:20Z

I reduced the precision for the metrics tests. Looks like they pass now, except for the two ARM jobs that failed to start (is there a way to restart/re-run them?). Also fixed a few uses of deprecated constructor arguments and the usage of r2_score in the dask tests. You need to pass "the truth" as first argument and the model's predictions as second. I think something in the computation in scikit-learn changed that lead to this making a difference now (my guess based on the fact that changing the order seems to have fixed it).

dantegd · 2024-05-29T15:48:55Z

/merge

mmccarty · 2024-05-31T17:09:48Z

@betatim You may want to update the title of this PR to say 1.5. I didn't realize you went to the latest release until reading through the discussion.

betatim requested a review from a team as a code owner April 17, 2024 13:38

github-actions bot added the Cython / Python Cython or Python issue label Apr 17, 2024

dantegd added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 17, 2024

dantegd added the 2 - In Progress Currenty a work in progress label Apr 24, 2024

Update scikit-learn to 1.4

5deed40

Mostly deals with parameters that were deprecated in scikit-learn 1.2 and are now no longer available. Update end version of deprecation Update version in which deprecation expires Change argument name to deal with deprecation

betatim force-pushed the upgrade-sklearn-1.4 branch from 6db6835 to 5deed40 Compare May 8, 2024 08:17

betatim and others added 2 commits May 8, 2024 08:56

Add NVIDIA copyright

a7e043e

Merge branch 'branch-24.06' into upgrade-sklearn-1.4

d75f551

betatim added 3 commits May 24, 2024 06:40

Adjust to new quantile transformer behaviour

4a21a15

Update log_loss test to have normalised probabilities

404bd84

Update dependencies to scikit-learn 1.5

7eca428

betatim requested a review from a team as a code owner May 24, 2024 07:07

github-actions bot added the conda conda issue label May 24, 2024

betatim added 3 commits May 24, 2024 08:41

Add missing blank line

9ac5425

Merge branch 'branch-24.06' into upgrade-sklearn-1.4

04ac30c

Fix docstrings

9e2d841

Decrease precision to avoid spurious failures on ARM

372d226

Update tests

9d40ba3

The arguments to r2_score were the wrong way around, this might explain the sudden decrease in the score which leads to failing tests.

caryr35 assigned trxcllnt May 28, 2024

dantegd approved these changes May 29, 2024

View reviewed changes

AyodeAwe approved these changes May 29, 2024

View reviewed changes

rapids-bot bot merged commit 326b049 into rapidsai:branch-24.06 May 29, 2024
65 of 68 checks passed

betatim deleted the upgrade-sklearn-1.4 branch May 29, 2024 16:09

trxcllnt removed their assignment May 29, 2024

jarmak-nv mentioned this pull request Jun 3, 2024

RAPIDS 24.06 Databricks Deployment Docs Update rapidsai/deployment#373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update scikit-learn to 1.4 #5851

Update scikit-learn to 1.4 #5851

betatim commented Apr 17, 2024 •

edited

copy-pr-bot bot commented Apr 17, 2024

bdice commented Apr 17, 2024

betatim commented Apr 19, 2024

vyasr commented Apr 20, 2024

betatim commented Apr 22, 2024

vyasr commented Apr 23, 2024

betatim commented May 3, 2024

dantegd commented May 6, 2024

bdice commented May 7, 2024

betatim commented May 23, 2024

betatim commented May 24, 2024

dantegd commented May 24, 2024

betatim commented May 27, 2024

dantegd commented May 29, 2024

mmccarty commented May 31, 2024

Update scikit-learn to 1.4 #5851

Update scikit-learn to 1.4 #5851

Conversation

betatim commented Apr 17, 2024 • edited

copy-pr-bot bot commented Apr 17, 2024

bdice commented Apr 17, 2024

betatim commented Apr 19, 2024

vyasr commented Apr 20, 2024

betatim commented Apr 22, 2024

vyasr commented Apr 23, 2024

betatim commented May 3, 2024

dantegd commented May 6, 2024

bdice commented May 7, 2024

betatim commented May 23, 2024

betatim commented May 24, 2024

dantegd commented May 24, 2024

betatim commented May 27, 2024

dantegd commented May 29, 2024

mmccarty commented May 31, 2024

betatim commented Apr 17, 2024 •

edited