New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding unit test to cover ties/duplicate x values in Isotonic Regression... #4185
Adding unit test to cover ties/duplicate x values in Isotonic Regression... #4185
Conversation
Small note: Travis failures are intended, as this is meant to cover the issue highlighted in #4184. |
I think we should maybe hard-code what the expected result is. In master, neigher |
…a() with ties=primary
Hi @amueller , yes, great suggestion. I used the R isotone package's examples from the Leeuw et al. paper in JSS as a base, and have committed expanded unit tests based on this.
|
|
||
def test_isotonic_regression_ties_primary_fit_transform(): | ||
""" | ||
Test isotonic regression fit_transform against the "primary" ties method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 spaces after transform but besides LGTM if travis is happy :)
thanks @mjbommar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, fixed now, thanks.
travis was happy other than these 3 failures:
======================================================================
FAIL: sklearn.tests.test_isotonic.test_isotonic_regression_ties_min
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/data/workspace/scikit-learn/sklearn/tests/test_isotonic.py", line 92, in test_isotonic_regression_ties_min
assert_array_equal(ir.fit(x, y).transform(x), ir.fit_transform(x, y))
File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", line 739, in assert_array_equal
verbose=verbose, header='Arrays are not equal')
File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", line 665, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not equal
(mismatch 28.5714285714%)
x: array([ 0., 0., 0., 3., 4., 5., 6.])
y: array([ 0., 1., 2., 3., 4., 5., 6.])
======================================================================
FAIL: sklearn.tests.test_isotonic.test_isotonic_regression_ties_max
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/data/workspace/scikit-learn/sklearn/tests/test_isotonic.py", line 103, in test_isotonic_regression_ties_max
assert_array_equal(ir.fit(x, y).transform(x), ir.fit_transform(x, y))
File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", line 739, in assert_array_equal
verbose=verbose, header='Arrays are not equal')
File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", line 665, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not equal
(mismatch 33.3333333333%)
x: array([ 1., 2., 3., 4., 0., 0.])
y: array([ 1., 2., 3., 4., 5., 6.])
======================================================================
FAIL: Test isotonic regression fit, transform against the "primary" ties method
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
self.test(*self.arg)
File "/data/workspace/scikit-learn/sklearn/tests/test_isotonic.py", line 134, in test_isotonic_regression_ties_primary_fit
assert_array_equal(ir.transform(x), y_true)
File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", line 739, in assert_array_equal
verbose=verbose, header='Arrays are not equal')
File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py", line 665, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Arrays are not equal
(mismatch 100.0%)
x: array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
y: array([ 21. , 22.375, 22.375, 22.375, 22.375, 22.375, 22.375,
22.375, 22.375, 23.5 , 25. ])
----------------------------------------------------------------------
Ran 24 tests in 0.041s
FAILED (failures=3)
travis still complains :( |
@agramfort , the purpose of these unit tests is to highlight a current issue that exists in 0.15.2 and 0.16-dev. The three failures that are occurring in travis are intended to fail :) |
assert_array_equal(ir.transform(x), y_true) | ||
|
||
|
||
def test_isotonic_regression_ties_primary_fit_transform(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the two in the same test, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
arrfff ... sorry for the noise. maybe @fabianp can have a look too. |
Our policy is to not commit to master failing tests without the fix. The reason is to be able to always have master green. If master is not green, it has a psychological detrimental effect on quality. |
@GaelVaroquaux , understood; just trying to help whoever picks up issue #4184. |
I think we should add a |
@amueller, +1 from my perspective. In my experience, the "secondary" and "tertiary" options described in the JSS paper are more useful than "primary" given that "primary" does not necessarily produce bijective mappings. Small chance I might be able to do this in the next week or two. Any opposition to reworking the cython source along these lines: |
I have no detailed knowledge and I think @NelleV or @agramfort might need to weight in. |
no opposition |
I think it is a good idea to have different ways to deal with ties. Go for it! |
OK, here are some thoughts; want to make sure that my particular use cases are not leading me astray.
|
@NelleV or @agramfort, if you have any thoughts about the question above, I would have some time to put towards a fix this week. |
no bandwidth to look into it :( |
I just ran those 3 tests against 0.15.2 and they also fail there: the first two tests ( https://gist.github.com/ogrisel/676f7d582600036efa60 So it seems that the The last test, namely |
Note that the |
Wait, when I last looked at it it was the other way around ... ?! |
Oh no, you are right, transform was the problem. |
I also want to highlight again the issue #2507 than can cause infinite loops: |
@ogrisel , the discussion here was unfortunately split between both issue #4184 and this PR. You can see that we had confirmed the failures in 0.15.2 as well in #4184 ; in other words, this is a regression that goes back some ways. While it's been a few weeks since I spent time thinking about it, I believe my comment here is the best synopsis of the ways forward: #4185 (comment) |
Thanks.
This does not explain the all zeros output I get when running |
@ogrisel , agreed on 0's. My line of inquiry led me to question what we meant by "expected" results in general. Our implementation is only a very narrow way of looking at the problem and may be too naive. These unit tests were meant to dock us to the corresponding R package released by the publication authors. For example, why is "slinear" the default and not a piece-wise constant interpolant? Also, should |
Here is a notebook that illustrates everything that goes horribly wrong in interp1d:
I would be ok with having any valid tie-breaking strategy, but that doesn't seem to be reasonable. |
With just x = [ 8., 10., 12., 14.]
y = [ 21. , 22.375, 22.375, 23.5] as input, everything looks good fyi. |
Proposed solution: report / ask scipy people, implement our own duplicate removal strategy. @mjbommar I agree there is much room for different strategies. I think the sklearn developers don't have that many uses outside of calibration, where duplicate values are basically non-existent. |
@mjbommar did you already put any thought into how to implement the tie-breaking? |
@amueller , my perspective was that we should try to dock our approach to the publication authors'. While they implement three approaches in their CRAN package, we could simply pick one; some have a means of breaking ties in the input sample. That said, my point above about the difference between |
I agree about using one of their approaches, I was more asking if you looked into implementing any of them. I'll have a closer look at the paper now... |
I think we need to implement the secondary approach. With the primary approach, we can not get |
Ugh, it looks like there is also an(other) issue with sample_weights. It looks like it is not reordered in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/isotonic.py#L253 ... |
@mjbommar what do you think about secondary tie-breaking as the default? |
Another question: What does it mean to predict on new data using the primary method? What would you predict for a point with ties? |
@amueller , I think choice of tie-breaking is tied to whether we support Since the client uncovered this issue, my first approach had been to replace the It sounds like we are seeing the same issues now :) |
Ok. The secondary strategy actually takes the mean and supports predict. I think that is a reasonable choice. |
@amueller , yes, perfect. I would still be happy to expand the methods to support other interpolants and primary/tertiary from |
I agree, it would be nice to have, and you are welcome to work on it. |
@mjbommar please avoid merging master into PR branch but instead squash old uninformative commits to cleanup the history and rebase on top of the current master. |
Ok no pbm. |
Unit test to highlight regression in issue #4184