Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

formula support for categorical endog variable in logistic regression #146

Open
stevenlis opened this issue May 16, 2019 · 1 comment
Open

Comments

@stevenlis
Copy link

stevenlis commented May 16, 2019

patsy: '0.5.1'

statsmodels/statsmodels#5552

SM: 0.9.0
For categorical endog variable in logistic regression, I still have to gerneate a dummay variable for it like the following.

import pandas as pd
import seaborn as sns
import numpy as np
import statsmodels.formula.api as smf
# generate dummy
df['male'] = df.sex.map({'Male': 1, 'Female': 0})
# regression
formula = 'male ~ C(smoker) + C(time)'
model = smf.logit(formula, data=df).fit()
model.summary()

If I just do

formula = 'C(sex) ~ C(smoker) + C(time)'
model = smf.logit(formula, data=df).fit()
model.summary()

I will get

ValueError: operands could not be broadcast together with shapes (244,2) (244,) 

This is a little bit weird, since the formula support all categorical variables but the endog. I wonder if this could be a poential feature to imporve.
Btw, is there any current workaround for this issue if I wanna use formula?

@bashtage:

This is a patsy limit. You could just define a function C1

def C1(cat):
     return pd.get_dummies(cat, drop_first=True)

and then use

formula = 'C1(sex) ~ C(smoker) + C(time)'

@stevenlis
Copy link
Author

stevenlis commented May 16, 2019

I also wish patsy could offer the ability to specify the coding. For example, if I have a variable with two categories: Yes and No. I may wanna code Yes as 0 and No as 1, or Yes as 1 and No as 0. I don't see any way to control this in pandas unlesss I review the coding returned by pandas to choose which one to use. I think it would be much easier to do so if we can specify it just like how we specify a baseline in a categorical variable in a patsy formula.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant