Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Isotonic regression duplicate fixes #4302

Merged

Conversation

amueller
Copy link
Member

Fixes #4184. With tests from #4185.
This is a stupid pure-python version. I'm not sure if there is an easy way to vectorize.
It now implements the "secondary" method, which basically replaces duplicate points with weighted averages. This is the only method that makes fit_transform behave identical to fit().transform(). I used a naive implementation of fit_transform().

@amueller
Copy link
Member Author

Ok so there is some fun here.
The test test_isotonic_regression tests that # check it doesn't change y when all x are equal.
Which is only possible with the primary or the tertiary method. Obviously if you don't change y when all x are equal, you can not have transform and fit_transform have the same results.

What do we do?

@amueller
Copy link
Member Author

To be clear, we have two choices:

  1. Remove the test, breaking backward compatibility with (somewhat broken) behavior.
  2. Implement primary (or tertiary) method, making fit(X).transform(X) not equal to fit_transform(X).

@amueller amueller added the Bug label Feb 27, 2015
@amueller amueller added this to the 0.16 milestone Feb 27, 2015
@agramfort
Copy link
Member

travis is not happy

@agramfort
Copy link
Member

agramfort commented Feb 27, 2015 via email

@amueller
Copy link
Member Author

Well currently we have tests that can not be fulfilled simultaneously. We test that fit_transform doesn't change y if all X are identical, and we have tests that ensure fit().transform() and fit_transform() are equivalent.
If we agree that 1) is the less evil way, I'll remove the other test. I'm currently writing the cython that can replace my naive python.

@agramfort
Copy link
Member

agramfort commented Feb 27, 2015 via email

@mjbommar
Copy link
Contributor

@amueller amueller force-pushed the isotonic_regression_duplicate_fixes branch 2 times, most recently from a15286e to 2c07413 Compare February 27, 2015 21:15
@mjbommar
Copy link
Contributor

@amueller , regarding a slightly faster way to implement the code, how about the comprehension below for non-weighted:

x_unique = np.unique(X)
y_unique = [np.average(y[X==x]) for x in x_unique]

And for weighted:

x_unique = np.unique(X)
y_unique = [np.average(y[X==x], weights=sample_weight[X==x]) for x in x_unique]

@amueller amueller changed the title WIP Isotonic regression duplicate fixes [MRG] Isotonic regression duplicate fixes Feb 27, 2015
@amueller
Copy link
Member Author

@mjbommar just checked in cython to do it. It might not be pretty, but it is pretty fast ;)

@amueller
Copy link
Member Author

Yeah I think the missing order was a bug and needs a regression test.
Otherwise the PR seems good to go.

@amueller
Copy link
Member Author

@mjbommar :

don't we need to apply the _make_unique transform prior to the determination of increasing as well?

where? Do we?

@mjbommar
Copy link
Contributor

When check_increasing runs in _build_y, we pass it data without running _make_unique. Especially if someone were to choose "pearson", you could imagine outliers that were averaged out that could switch the direction.

On the flip side, the p-values of the "true" relationship are more informative than the p-values for the x-averaged data.t

@amueller amueller force-pushed the isotonic_regression_duplicate_fixes branch from 3d566eb to f4780c3 Compare February 27, 2015 21:36
@amueller
Copy link
Member Author

I feel that if we include the _make_unique in the check_increasing test, the behavior might be harder to understand. You are right, it might change behavior in edge-cases. How strongly do you feel about this?

@amueller amueller force-pushed the isotonic_regression_duplicate_fixes branch from 5883ceb to 0a85873 Compare February 27, 2015 21:47
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.1% when pulling 0a85873 on amueller:isotonic_regression_duplicate_fixes into 1655a04 on scikit-learn:master.

@mjbommar
Copy link
Contributor

@amueller , not too strong, to be honest. What about adding an "auto_unique" option to the increasing argument options?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling bfb287e on amueller:isotonic_regression_duplicate_fixes into * on scikit-learn:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling bfb287e on amueller:isotonic_regression_duplicate_fixes into * on scikit-learn:master*.

@cython.cdivision(True)
def _make_unique(np.ndarray[dtype=np.float64_t] X,
np.ndarray[dtype=np.float64_t] y,
np.ndarray[dtype=np.float64_t] sample_weights):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a docstring to explain that the number of samples is reduced whenever there is a tie in X: the matching y values are averaged using the weights and the sample_weights are accumulated to conserve the original info.

It is my understanding that X should be sorted before calling this utility, this should also be made explicit in the docstring or in an inline comment to improve understandability of the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@ogrisel
Copy link
Member

ogrisel commented Mar 2, 2015

This LGTM. All the calibration examples still look good. I have not checked the isotonic regression examples. I opened a PR against this PR amueller#26 to remove the calibration random jitter used for tie breaking.

@ogrisel
Copy link
Member

ogrisel commented Mar 2, 2015

@amueller before merging, please rewrite the commit message to make it more descriptive of the actual content of this commit.

@amueller
Copy link
Member Author

amueller commented Mar 2, 2015

merged your PR. We need another review, right?

@ogrisel
Copy link
Member

ogrisel commented Mar 2, 2015

Maybe @agramfort is around for a final review. @mjbommar WDYT of the current state of this PR?

@amueller amueller force-pushed the isotonic_regression_duplicate_fixes branch from 2465298 to 6ba63bd Compare March 2, 2015 22:33
@amueller
Copy link
Member Author

amueller commented Mar 2, 2015

reworded the commit message (rebase -i ftw).

@mjbommar
Copy link
Contributor

mjbommar commented Mar 2, 2015

Looked good to me earlier this week. Commits today were mostly cosmits, right?

I have a fix for issue #4297 ready to PR once this is merged too, which would result in passing tests.

@amueller
Copy link
Member Author

amueller commented Mar 2, 2015

Yup, only cosmetics here.

@ogrisel ogrisel changed the title [MRG] Isotonic regression duplicate fixes [MRG+!] Isotonic regression duplicate fixes Mar 2, 2015
@ogrisel ogrisel changed the title [MRG+!] Isotonic regression duplicate fixes [MRG+1] Isotonic regression duplicate fixes Mar 2, 2015
@amueller
Copy link
Member Author

amueller commented Mar 4, 2015

This is kind of an important bug-fix for the release so it would be great if any other dev had time for a look....

mjbommar and others added 3 commits March 4, 2015 12:04
…ion re: issue scikit-learn#4184

Expanding tests to include ties at both x_min and x_max

Updating unit test to include reference data against R's isotone gpava() with ties=primary

Adding R and isotone package versions for reproducibility/documentation

Removing double space in docstring

Combining tests for fit and transform with ties; fixing spelling error
This strategy allows us to make fit_transform(X) behave the same as
fit(X).transform(X)
Remove test for not touching duplicate entries in fit_transform().
The isotonic regression routine now implements deterministic
tie-breaking by default.
@amueller amueller force-pushed the isotonic_regression_duplicate_fixes branch from 6ba63bd to c1fa16f Compare March 4, 2015 17:04
@mjbommar
Copy link
Contributor

mjbommar commented Mar 4, 2015

And after this is merged, we need to pull my followup for the infinite loop too, so another PR review (though brief) after.

@ogrisel
Copy link
Member

ogrisel commented Mar 5, 2015

@jnothman it would be great if you could have a look at this one.

@amueller
Copy link
Member Author

amueller commented Mar 5, 2015

Or maybe @GaelVaroquaux ;)

@ogrisel
Copy link
Member

ogrisel commented Mar 5, 2015

@agramfort if you have time tomorrow, I would really like to have this fix in for the 0.16 beta but we cannot delay the release of the beta further.

@mjbommar
Copy link
Contributor

mjbommar commented Mar 6, 2015

@amueller , would swapping in my comprehensions make this many fewer LOC to review? %timeit gave me <50ms timings for realistic sample sizes, though I didn't actually compare your cython to it.

@amueller
Copy link
Member Author

amueller commented Mar 6, 2015

@mjbommar really? It gave me 40s while mine was three or four orders of magnitude faster... Also, I'm not sure the Cython is the problem. I'll check how many samples I used when I'm on the other box. But I think it was around 1e6 or 1e7.

@amueller
Copy link
Member Author

amueller commented Mar 6, 2015

Btw, I assumed that all samples but one are distinct. I think it is reasonable to assume that most samples are distinct. In this case, the list comprehension has quadratic complexity, right?
I'm usually not a person to jump to cython quickly but quadratic complexity for an algorithm that clearly should be linear is pretty bad.

@mjbommar
Copy link
Contributor

mjbommar commented Mar 6, 2015

@amueller , not arguing your solution was better, just really hoping we can get this into the 0.16 release and wondering what would help the rest of the team review

@amueller
Copy link
Member Author

amueller commented Mar 6, 2015

Yeah I understand your motivation, I just wanted to say that I think there is good reason. We will get this into 0.16. I suspect the reason that it is not reviewed yet is that it takes people a bit of time to understand what the actual problem was.

@ogrisel
Copy link
Member

ogrisel commented Mar 6, 2015

Alright, let's merge this fix.

ogrisel added a commit that referenced this pull request Mar 6, 2015
…fixes

[MRG+1] Isotonic regression duplicate fixes
@ogrisel ogrisel merged commit ec84a00 into scikit-learn:master Mar 6, 2015
@amueller amueller deleted the isotonic_regression_duplicate_fixes branch March 6, 2015 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IsotonicRegression results differ between fit/transform and fit_transform with ties in X
5 participants