Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace boston in ensemble test_forest #16927

Merged
merged 11 commits into from May 20, 2020

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented Apr 15, 2020

Reference Issues/PRs

Towards #16155

What does this implement/fix? Explain your changes.

Replaces boston dataset with subset of california housing diabetes dataset in sklearn/ensemble/tests/test_forest.py

Any other comments?

Did not use diabetes dataset due to poor R2 score and oob score in test_oob_score_regressors (as picked by @adrinjalali in prev PR).
Poor R2 score in test_oob_score_regressors with diabetes dataset. Happy to change to California/another dataset if this is a problem.

@lucyleeow
Copy link
Member Author

I think the failures are due to the use of the california dataset, this is a message from one of the failures:

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\VssAdministrator\scikit_learn_data
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\VssAdministrator\scikit_learn_data
WARNING: Failed to generate report: No data to report.

Link to line: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=15400&view=logs&j=d32b16b6-cb9d-571b-e765-de83708fb8dd&t=b93f76c1-c2c9-579e-c2ec-c4f438af1261&l=34

But other tests use the california dataset as well so I don't understand the cause of the failure...

@lucyleeow lucyleeow changed the title Replace boston in ensemble test_forest WIP Replace boston in ensemble test_forest Apr 15, 2020
@cmarmo
Copy link
Member

cmarmo commented Apr 15, 2020

Hi @lucyleeow the failing test seems to be related to pytest-dev/pytest#6925, as only CI with pytest 5.4.1 are failing.
@thomasjpfan, I think you are quite familiar with dependency issues... do you have any suggestion? Thanks a lot!

@lucyleeow
Copy link
Member Author

It might be because i didn't add @pytest.mark.network. @thomasjpfan mentioned this in a different PR. I will rework this to use diabetes dataset and if we aren't happy about the score in test_oob_score_regressors I can amend just that test to use california.

@lucyleeow lucyleeow changed the title WIP Replace boston in ensemble test_forest Replace boston in ensemble test_forest Apr 16, 2020
@thomasjpfan
Copy link
Member

While reviewing your PRs, this dataset looks have pretty poor results on test sets compared to the training set, which means the default parameters are overfitting:

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate

X, y = load_diabetes(return_X_y=True)
results = cross_validate(RandomForestRegressor(random_state=0), X, y,return_train_score=True)

print("train score", results['train_score'].mean())
print("test score", results['test_score'].mean())
# train score 0.9210138461843774
# test score 0.4230661480472566

@lucyleeow
Copy link
Member Author

lucyleeow commented Apr 16, 2020

Oh wow that's a big difference. What kind of tests should I be careful on? I might not be good at assessing this e.g., check_diabetes_criterion doesn't split the data. I think it checks if the score is comparable for max_features=None and max_features=6 and the criteria "mse", "mae" and "friedman_mse". Will the overfitting make it unsuitable for this?

Edit: should I tune parameters and use tuned parameters for all the tests?

@ogrisel
Copy link
Member

ogrisel commented Apr 21, 2020

I would try to to use min_samples_leaf=10 or more to limit overfitting.

@ogrisel
Copy link
Member

ogrisel commented Apr 21, 2020

Hum maybe diabetes is too hard of a dataset to expect good generalization accuracy from the RandomForestRegressor model. Maybe you should try with an easier dataset for which is possible to get higher test scores.

@ogrisel
Copy link
Member

ogrisel commented Apr 21, 2020

Or maybe this is good enough for such tests. It's still significantly better than random.

@ogrisel
Copy link
Member

ogrisel commented Apr 22, 2020

Or feel free to use make_regression to generate an easy regression problem for the tests as discussed in #16918.

@lucyleeow
Copy link
Member Author

lucyleeow commented Apr 22, 2020

I tried using make_regression dataset but for test_oob_score_regressors,
assert test_score > est.oob_score_ doesn't pass.

Tried make_regression(n_samples=500, n_features=10) and a range of values for n_informative and noise.

Is this a problem with the way the dataset is generated? Since test_score should always (?) be higher than est.oob_score_, as oob_score_ uses unseen data.

from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

X_reg, y_reg = make_regression(n_samples=500, n_features=10, n_informative=100,
                               random_state=1)

est = RandomForestRegressor(oob_score=True, random_state=0,
                            n_estimators=50, bootstrap=True)
n_samples = X_reg.shape[0]
est.fit(X_reg[:n_samples // 2, :], y_reg[:n_samples // 2])
test_score = est.score(X_reg[n_samples // 2:, :], y_reg[n_samples // 2:])

print(test_score)
print(est.oob_score_)

Gives:

0.7351186379920649
0.7425675493740419

Regardles, happy to keep as diabetes as well.

@glemaitre
Copy link
Member

I think that one expects the OOB score to be really close to the score that you will obtain on the test set. So for this test, I would make the diff between the OOB and the test and check that it is smaller than 0.05 for instance.

I am really surprised for the diabetes results indeed.

@@ -389,7 +389,7 @@ def check_oob_score(name, X, y, n_estimators=20):
assert abs(test_score - est.oob_score_) < 0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we are already doing this for the classification. I think it makes sense to do that for the regression.
We only require a comment to mention that in the first case, this is a diff between accuracies and in the second one a diff between R2.

@lucyleeow
Copy link
Member Author

I think that one expects the OOB score to be really close to the score that you will obtain on the test set.

I read the code wrong and thought we were fitting and testing on the same data, which is why I thought oob_score would always be worse. It makes sense. Will fix, thanks @glemaitre

@glemaitre glemaitre added this to REVIEWED AND WAITING FOR CHANGES in Guillaume's pet May 19, 2020
@lucyleeow
Copy link
Member Author

Thanks @glemaitre. I amended all test to use the generated regression dataset.

@glemaitre glemaitre merged commit 5dcdd89 into scikit-learn:master May 20, 2020
@glemaitre
Copy link
Member

Thanks @lucyleeow

@lucyleeow lucyleeow deleted the test_forest branch May 20, 2020 11:05
@glemaitre glemaitre moved this from REVIEWED AND WAITING FOR CHANGES to MERGED in Guillaume's pet May 20, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants