Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing pandas 1.6.0 and 2.0.0 breaking changes #580

Merged
merged 4 commits into from
Nov 2, 2022

Conversation

MarcAntoineSchmidtQC
Copy link
Member

In the development version of pandas, get_dummies generates boolean data, and when used in arithmetic operations, this becomes a numpy array with dtype python object.

Glum was actually not the problem. When using a pandas dataframe as inputs, tabmat takes care to cast everything to the right dtype. The problem was with the tests themselves where they would pandas dataframes with .values to acces the underlying numpy array.

Coincidentally, 2 days ago pandas released the 2.0.0 development version. Since we started seeing the problem before that, it seems to be happening with 1.6.0 and 2.0.0.

This PR fixes #575.

TODO: remove daily CI on push

Checklist

  • Added a CHANGELOG.rst entry

@jtilly
Copy link
Member

jtilly commented Oct 20, 2022

The get_dummies dtype change was introduced in 1.6.0 and classified as bugfix, see pandas-dev/pandas#48022 (review).

@@ -1827,7 +1827,7 @@ def test_dataframe_std_errors(regression_data, categorical, split, fit_intercept
alpha=0, family="normal", fit_intercept=fit_intercept
)
mdl.fit(X=X, y=y)
X_sm = pd.get_dummies(X)
X_sm = pd.get_dummies(X).to_numpy(dtype=float)
if fit_intercept:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_dummies also accepts a dtype argument.

@bashtage
Copy link

I agree that this change is problematic. It caused the canary build of statsmodels to also start failing. I think this change unfortunately is going to cause a lot of problems, and it should seriously be considered for rolling back. The np.asarray(df.get_dummies()) returning object is a very hard to debug problem. Even df.get_dummies().to_numpy() produces a not-useful dtype.

@bashtage
Copy link

Oops. I should have made most of my case on the pandas tracker.

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC merged commit c35e291 into main Nov 2, 2022
@MarcAntoineSchmidtQC MarcAntoineSchmidtQC deleted the fix_nightly branch November 2, 2022 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Daily run failure: Unit tests
3 participants