-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Isotonic regression duplicate fixes #4302
[MRG+1] Isotonic regression duplicate fixes #4302
Conversation
Ok so there is some fun here. What do we do? |
To be clear, we have two choices:
|
travis is not happy |
2) is even more evil than 1)
|
Well currently we have tests that can not be fulfilled simultaneously. We test that |
1) sounds good to me
|
@amueller , don't we need to apply the |
a15286e
to
2c07413
Compare
@amueller , regarding a slightly faster way to implement the code, how about the comprehension below for non-weighted:
And for weighted:
|
@mjbommar just checked in cython to do it. It might not be pretty, but it is pretty fast ;) |
Yeah I think the missing order was a bug and needs a regression test. |
where? Do we? |
When On the flip side, the p-values of the "true" relationship are more informative than the p-values for the x-averaged data.t |
3d566eb
to
f4780c3
Compare
I feel that if we include the |
5883ceb
to
0a85873
Compare
@amueller , not too strong, to be honest. What about adding an |
c727090
to
bfb287e
Compare
Changes Unknown when pulling bfb287e on amueller:isotonic_regression_duplicate_fixes into * on scikit-learn:master*. |
Changes Unknown when pulling bfb287e on amueller:isotonic_regression_duplicate_fixes into * on scikit-learn:master*. |
@cython.cdivision(True) | ||
def _make_unique(np.ndarray[dtype=np.float64_t] X, | ||
np.ndarray[dtype=np.float64_t] y, | ||
np.ndarray[dtype=np.float64_t] sample_weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a docstring to explain that the number of samples is reduced whenever there is a tie in X: the matching y values are averaged using the weights and the sample_weights are accumulated to conserve the original info.
It is my understanding that X should be sorted before calling this utility, this should also be made explicit in the docstring or in an inline comment to improve understandability of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
This LGTM. All the calibration examples still look good. I have not checked the isotonic regression examples. I opened a PR against this PR amueller#26 to remove the calibration random jitter used for tie breaking. |
@amueller before merging, please rewrite the commit message to make it more descriptive of the actual content of this commit. |
merged your PR. We need another review, right? |
Maybe @agramfort is around for a final review. @mjbommar WDYT of the current state of this PR? |
2465298
to
6ba63bd
Compare
reworded the commit message (rebase -i ftw). |
Looked good to me earlier this week. Commits today were mostly cosmits, right? I have a fix for issue #4297 ready to PR once this is merged too, which would result in passing tests. |
Yup, only cosmetics here. |
This is kind of an important bug-fix for the release so it would be great if any other dev had time for a look.... |
…ion re: issue scikit-learn#4184 Expanding tests to include ties at both x_min and x_max Updating unit test to include reference data against R's isotone gpava() with ties=primary Adding R and isotone package versions for reproducibility/documentation Removing double space in docstring Combining tests for fit and transform with ties; fixing spelling error
This strategy allows us to make fit_transform(X) behave the same as fit(X).transform(X) Remove test for not touching duplicate entries in fit_transform().
The isotonic regression routine now implements deterministic tie-breaking by default.
6ba63bd
to
c1fa16f
Compare
And after this is merged, we need to pull my followup for the infinite loop too, so another PR review (though brief) after. |
@jnothman it would be great if you could have a look at this one. |
Or maybe @GaelVaroquaux ;) |
@agramfort if you have time tomorrow, I would really like to have this fix in for the 0.16 beta but we cannot delay the release of the beta further. |
@amueller , would swapping in my comprehensions make this many fewer LOC to review? |
@mjbommar really? It gave me 40s while mine was three or four orders of magnitude faster... Also, I'm not sure the Cython is the problem. I'll check how many samples I used when I'm on the other box. But I think it was around 1e6 or 1e7. |
Btw, I assumed that all samples but one are distinct. I think it is reasonable to assume that most samples are distinct. In this case, the list comprehension has quadratic complexity, right? |
@amueller , not arguing your solution was better, just really hoping we can get this into the 0.16 release and wondering what would help the rest of the team review |
Yeah I understand your motivation, I just wanted to say that I think there is good reason. We will get this into 0.16. I suspect the reason that it is not reviewed yet is that it takes people a bit of time to understand what the actual problem was. |
Alright, let's merge this fix. |
…fixes [MRG+1] Isotonic regression duplicate fixes
Fixes #4184. With tests from #4185.
This is a stupid pure-python version. I'm not sure if there is an easy way to vectorize.
It now implements the "secondary" method, which basically replaces duplicate points with weighted averages. This is the only method that makes
fit_transform
behave identical tofit().transform()
. I used a naive implementation offit_transform()
.