[MRG+1] Isotonic regression duplicate fixes #4302

amueller · 2015-02-27T19:29:28Z

Fixes #4184. With tests from #4185.
This is a stupid pure-python version. I'm not sure if there is an easy way to vectorize.
It now implements the "secondary" method, which basically replaces duplicate points with weighted averages. This is the only method that makes fit_transform behave identical to fit().transform(). I used a naive implementation of fit_transform().

amueller · 2015-02-27T19:30:48Z

Ok so there is some fun here.
The test test_isotonic_regression tests that # check it doesn't change y when all x are equal.
Which is only possible with the primary or the tertiary method. Obviously if you don't change y when all x are equal, you can not have transform and fit_transform have the same results.

What do we do?

amueller · 2015-02-27T19:33:03Z

To be clear, we have two choices:

Remove the test, breaking backward compatibility with (somewhat broken) behavior.
Implement primary (or tertiary) method, making fit(X).transform(X) not equal to fit_transform(X).

agramfort · 2015-02-27T20:46:27Z

travis is not happy

agramfort · 2015-02-27T20:47:41Z

2) is even more evil than 1)

amueller · 2015-02-27T20:56:34Z

Well currently we have tests that can not be fulfilled simultaneously. We test that fit_transform doesn't change y if all X are identical, and we have tests that ensure fit().transform() and fit_transform() are equivalent.
If we agree that 1) is the less evil way, I'll remove the other test. I'm currently writing the cython that can replace my naive python.

agramfort · 2015-02-27T20:58:12Z

1) sounds good to me

mjbommar · 2015-02-27T21:02:19Z

@amueller , don't we need to apply the _make_unique transform prior to the determination of increasing as well?

https://github.com/amueller/scikit-learn/blob/isotonic_regression_duplicate_fixes/sklearn/isotonic.py#L271

https://github.com/amueller/scikit-learn/blob/isotonic_regression_duplicate_fixes/sklearn/isotonic.py#L64

mjbommar · 2015-02-27T21:16:28Z

@amueller , regarding a slightly faster way to implement the code, how about the comprehension below for non-weighted:

x_unique = np.unique(X)
y_unique = [np.average(y[X==x]) for x in x_unique]

And for weighted:

x_unique = np.unique(X)
y_unique = [np.average(y[X==x], weights=sample_weight[X==x]) for x in x_unique]

amueller · 2015-02-27T21:24:54Z

@mjbommar just checked in cython to do it. It might not be pretty, but it is pretty fast ;)

amueller · 2015-02-27T21:25:36Z

Yeah I think the missing order was a bug and needs a regression test.
Otherwise the PR seems good to go.

amueller · 2015-02-27T21:26:54Z

@mjbommar :

don't we need to apply the _make_unique transform prior to the determination of increasing as well?

where? Do we?

mjbommar · 2015-02-27T21:34:36Z

When check_increasing runs in _build_y, we pass it data without running _make_unique. Especially if someone were to choose "pearson", you could imagine outliers that were averaged out that could switch the direction.

On the flip side, the p-values of the "true" relationship are more informative than the p-values for the x-averaged data.t

amueller · 2015-02-27T21:41:07Z

I feel that if we include the _make_unique in the check_increasing test, the behavior might be harder to understand. You are right, it might change behavior in edge-cases. How strongly do you feel about this?

coveralls · 2015-02-27T21:54:22Z

Coverage increased (+0.0%) to 95.1% when pulling 0a85873 on amueller:isotonic_regression_duplicate_fixes into 1655a04 on scikit-learn:master.

mjbommar · 2015-02-27T22:18:59Z

@amueller , not too strong, to be honest. What about adding an "auto_unique" option to the increasing argument options?

coveralls · 2015-03-01T19:58:56Z

Changes Unknown when pulling bfb287e on amueller:isotonic_regression_duplicate_fixes into * on scikit-learn:master*.

coveralls · 2015-03-01T20:04:30Z

Changes Unknown when pulling bfb287e on amueller:isotonic_regression_duplicate_fixes into * on scikit-learn:master*.

ogrisel · 2015-03-02T21:28:47Z

sklearn/_isotonic.pyx

+@cython.cdivision(True)
+def _make_unique(np.ndarray[dtype=np.float64_t] X,
+                  np.ndarray[dtype=np.float64_t] y,
+                  np.ndarray[dtype=np.float64_t] sample_weights):


Would be good to add a docstring to explain that the number of samples is reduced whenever there is a tie in X: the matching y values are averaged using the weights and the sample_weights are accumulated to conserve the original info.

It is my understanding that X should be sorted before calling this utility, this should also be made explicit in the docstring or in an inline comment to improve understandability of the code.

ogrisel · 2015-03-02T22:28:03Z

This LGTM. All the calibration examples still look good. I have not checked the isotonic regression examples. I opened a PR against this PR amueller#26 to remove the calibration random jitter used for tie breaking.

ogrisel · 2015-03-02T22:28:46Z

@amueller before merging, please rewrite the commit message to make it more descriptive of the actual content of this commit.

amueller · 2015-03-02T22:29:36Z

merged your PR. We need another review, right?

ogrisel · 2015-03-02T22:31:11Z

Maybe @agramfort is around for a final review. @mjbommar WDYT of the current state of this PR?

amueller · 2015-03-02T22:33:35Z

reworded the commit message (rebase -i ftw).

mjbommar · 2015-03-02T22:33:51Z

Looked good to me earlier this week. Commits today were mostly cosmits, right?

I have a fix for issue #4297 ready to PR once this is merged too, which would result in passing tests.

amueller · 2015-03-02T22:35:14Z

Yup, only cosmetics here.

amueller · 2015-03-04T17:03:53Z

This is kind of an important bug-fix for the release so it would be great if any other dev had time for a look....

…ion re: issue scikit-learn#4184 Expanding tests to include ties at both x_min and x_max Updating unit test to include reference data against R's isotone gpava() with ties=primary Adding R and isotone package versions for reproducibility/documentation Removing double space in docstring Combining tests for fit and transform with ties; fixing spelling error

This strategy allows us to make fit_transform(X) behave the same as fit(X).transform(X) Remove test for not touching duplicate entries in fit_transform().

The isotonic regression routine now implements deterministic tie-breaking by default.

mjbommar · 2015-03-04T18:22:49Z

And after this is merged, we need to pull my followup for the infinite loop too, so another PR review (though brief) after.

ogrisel · 2015-03-05T15:17:55Z

@jnothman it would be great if you could have a look at this one.

amueller · 2015-03-05T17:26:00Z

Or maybe @GaelVaroquaux ;)

ogrisel · 2015-03-05T20:34:21Z

@agramfort if you have time tomorrow, I would really like to have this fix in for the 0.16 beta but we cannot delay the release of the beta further.

mjbommar · 2015-03-06T14:21:31Z

@amueller , would swapping in my comprehensions make this many fewer LOC to review? %timeit gave me <50ms timings for realistic sample sizes, though I didn't actually compare your cython to it.

amueller · 2015-03-06T15:37:35Z

@mjbommar really? It gave me 40s while mine was three or four orders of magnitude faster... Also, I'm not sure the Cython is the problem. I'll check how many samples I used when I'm on the other box. But I think it was around 1e6 or 1e7.

amueller · 2015-03-06T15:43:08Z

Btw, I assumed that all samples but one are distinct. I think it is reasonable to assume that most samples are distinct. In this case, the list comprehension has quadratic complexity, right?
I'm usually not a person to jump to cython quickly but quadratic complexity for an algorithm that clearly should be linear is pretty bad.

mjbommar · 2015-03-06T15:51:58Z

@amueller , not arguing your solution was better, just really hoping we can get this into the 0.16 release and wondering what would help the rest of the team review

amueller · 2015-03-06T15:57:50Z

Yeah I understand your motivation, I just wanted to say that I think there is good reason. We will get this into 0.16. I suspect the reason that it is not reviewed yet is that it takes people a bit of time to understand what the actual problem was.

ogrisel · 2015-03-06T16:29:42Z

Alright, let's merge this fix.

…fixes [MRG+1] Isotonic regression duplicate fixes

amueller added the Bug label Feb 27, 2015

amueller added this to the 0.16 milestone Feb 27, 2015

amueller force-pushed the isotonic_regression_duplicate_fixes branch 2 times, most recently from a15286e to 2c07413 Compare February 27, 2015 21:15

amueller changed the title ~~WIP Isotonic regression duplicate fixes~~ [MRG] Isotonic regression duplicate fixes Feb 27, 2015

amueller force-pushed the isotonic_regression_duplicate_fixes branch from 3d566eb to f4780c3 Compare February 27, 2015 21:36

amueller force-pushed the isotonic_regression_duplicate_fixes branch from 5883ceb to 0a85873 Compare February 27, 2015 21:47

This was referenced Feb 27, 2015

Fix for issue #4297, infinite isotonic mjbommar/scikit-learn#1

Closed

Infinite loop when running isotonic regression with some zero-valued weights #4297

Closed

Adding unit test to cover ties/duplicate x values in Isotonic Regression... #4185

Closed

amueller force-pushed the isotonic_regression_duplicate_fixes branch 2 times, most recently from c727090 to bfb287e Compare March 1, 2015 19:48

ogrisel reviewed Mar 2, 2015
View reviewed changes

amueller force-pushed the isotonic_regression_duplicate_fixes branch from 2465298 to 6ba63bd Compare March 2, 2015 22:33

ogrisel changed the title ~~[MRG] Isotonic regression duplicate fixes~~ [MRG+!] Isotonic regression duplicate fixes Mar 2, 2015

ogrisel changed the title ~~[MRG+!] Isotonic regression duplicate fixes~~ [MRG+1] Isotonic regression duplicate fixes Mar 2, 2015

mjbommar and others added 3 commits March 4, 2015 12:04

Implement "secondary" tie strategy in isotonic.

ab556be

This strategy allows us to make fit_transform(X) behave the same as fit(X).transform(X) Remove test for not touching duplicate entries in fit_transform().

ENH no need for tie breaking jitter in calibration

c1fa16f

The isotonic regression routine now implements deterministic tie-breaking by default.

amueller force-pushed the isotonic_regression_duplicate_fixes branch from 6ba63bd to c1fa16f Compare March 4, 2015 17:04

ogrisel added a commit that referenced this pull request Mar 6, 2015

Merge pull request #4302 from amueller/isotonic_regression_duplicate_…

ec84a00

…fixes [MRG+1] Isotonic regression duplicate fixes

ogrisel merged commit ec84a00 into scikit-learn:master Mar 6, 2015

amueller deleted the isotonic_regression_duplicate_fixes branch March 6, 2015 16:30

vinayak-mehta mentioned this pull request Mar 6, 2015

Running tests should not print anything on stdout / stderr or warnings #2274

Closed

ahmed-mahran mentioned this pull request Dec 8, 2022

[SPARK-41008][MLLIB] Dedup isotonic regression duplicate features apache/spark#38966

Closed

[MRG+1] Isotonic regression duplicate fixes #4302

[MRG+1] Isotonic regression duplicate fixes #4302

Conversation

amueller commented Feb 27, 2015

amueller commented Feb 27, 2015

amueller commented Feb 27, 2015

agramfort commented Feb 27, 2015

agramfort commented Feb 27, 2015 via email

amueller commented Feb 27, 2015

agramfort commented Feb 27, 2015 via email

mjbommar commented Feb 27, 2015

mjbommar commented Feb 27, 2015

amueller commented Feb 27, 2015

amueller commented Feb 27, 2015

amueller commented Feb 27, 2015

mjbommar commented Feb 27, 2015

amueller commented Feb 27, 2015

coveralls commented Feb 27, 2015

mjbommar commented Feb 27, 2015

coveralls commented Mar 1, 2015

coveralls commented Mar 1, 2015

ogrisel Mar 2, 2015

Choose a reason for hiding this comment

amueller Mar 2, 2015

Choose a reason for hiding this comment

ogrisel commented Mar 2, 2015

ogrisel commented Mar 2, 2015

amueller commented Mar 2, 2015

ogrisel commented Mar 2, 2015

amueller commented Mar 2, 2015

mjbommar commented Mar 2, 2015

amueller commented Mar 2, 2015

amueller commented Mar 4, 2015

mjbommar commented Mar 4, 2015

ogrisel commented Mar 5, 2015

amueller commented Mar 5, 2015

ogrisel commented Mar 5, 2015

mjbommar commented Mar 6, 2015

amueller commented Mar 6, 2015

amueller commented Mar 6, 2015

mjbommar commented Mar 6, 2015

amueller commented Mar 6, 2015

ogrisel commented Mar 6, 2015