Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrote a convenience function for getting variable names from formula #171

Open
compumetrika opened this issue May 15, 2021 · 1 comment
Open

Comments

@compumetrika
Copy link

compumetrika commented May 15, 2021

I am using patsy as a key dependency in a stats project, and I found myself needing to identify which variables are categorical after constructing a dataframe using patsy formulas.

After an attempt using regexps ("...now you have two problems..."), I read Model specification for experts and computers a few times, and spent a lot of time poking around in X.design_info (where y, X=dmatrices(formula, data, return_type='dataframe')). Thankfully I ended up with something much shorter and more robust than my regexps attempt.

I have two questions:

  1. I'm still not sure if I've used the interiors details of X.design_info correctly -- it does what I want but there are places where multiple things provide the same info. I'd love to have someone "in the know" look at the function and tell me if I should make a different choice. Is there a way to do this? (Counting comments the function is ~60 lines; not counting comments it is about 30 lines).

  2. Is there any interest in having something like this contributed back to the project? I've commented and unit tested the function already, and happy to make sure final comments/tests conform to your norms & standards. I skimmed the issues before posting, and for example it appears this issue patsy equivalent of R's all.vars #155: patsy equivalent of R's all.vars might benefit from my function (not exactly the same but perhaps close enough).

@compumetrika
Copy link
Author

Here's the function, in case anyone finds it useful:

# Find all categorical variables:
def get_variable_types_names(X):
    # Given a RHS dataset X produced by dmatrices:
    #    y, X=dmatrices(formula, data, return_type='dataframe')
    # extract the "formula names" for all variables,
    # eg. "C(variable),"
    # and for each formula name, extract whether 
    # it is numerical or categorical. If categorical, 
    # extract all actual variable names and save them.
    # 
    # Return a dict that has the structure:
    #     
    # variable_types_names = {'numerical':[ <list of numerical variable names> ],
    #                         'categorical':{<dict of categorical names, where 
    #                                        the keys `prime name` of each 
    #                                        category, and vals are the 
    #                                        actual corresponding variable
    #                                        names from list(X)}}

    remaining_X_names = list(X)  # eventually check which of these are not selected
    if 'Intercept' in remaining_X_names:
        remaining_X_names.remove('Intercept')
    variable_types_names = {}

    # First get all formula names:
    all_formula_evals = [key for key in X.design_info.factor_infos.keys()] 
    all_formula_names = [key.name() for key in all_formula_evals] # string human readable-type

        # Note that this appears to exclude the interaction variables, but not function-ified 
        # variables. Not sure why, will need to examine

    # Now loop over all formula_names and extract:
    #     - type (numerical, categorical, or something else)
    #     - if categorical, the sub-categories need to be extracted as a list
    #     - otherwise if numerical, the sub-category is the same as the name
    for i in range(len(all_formula_names)):
        var_eval_key = all_formula_evals[i]
        var_name = all_formula_names[i]

        # Extract the type; if haven't set this type up in
        # variable_types_names yet, set it up:
        vartype = X.design_info.factor_infos[var_eval_key].type
        if vartype not in variable_types_names.keys():
            if vartype == 'categorical':
                variable_types_names[vartype] = {}  # For each key need to save sub-types
            else:
                variable_types_names[vartype] = []

        # If type is 'categorical', grab all sub-types:
        if vartype == 'categorical':
            slicer = X.design_info.slice(var_name)
            temp_varnames = X.design_info.column_names[slicer]
            variable_types_names[vartype][var_name] = temp_varnames
        else:
            variable_types_names[vartype].append(var_name)
            temp_varnames = [var_name]

        # Finally, remove the variables that have been saved
        [remaining_X_names.remove(var) for var in temp_varnames]

    # At the end of the loop, assign all remaining to ...numerical, if exists:
    for var in remaining_X_names:
        slicer = X.design_info.slice(var)
        temp_varnames = X.design_info.column_names[slicer]
        if len(temp_varnames) > 1:
            warnings.warn("There is more than 1 sub-cateogry of var for variable "+str(var)+": "+
                          str(temp_varnames))
        # now add to numerical, I suppose...
        # TODO: figure out better check and etc...
        if 'numerical' not in variable_types_names.keys():
            variable_types_names['numerical'] = [var for var in temp_varnames]
        else:
            variable_types_names['numerical'] += [var for var in temp_varnames]

    return variable_types_names




def test_get_variable_types_names():
    '''
    Build a test dataset and confirm that get_variable_types_names
    does what we want.
    '''

    # Set up some data to have categorical, numerical, and interaction
    data = {'a':['a1','a2','a3','a1','a1'],
            'x':[1,2,3,4,5],
            'z':[0.5,0.5,0.5,0.5,0.5],
            'y':[5.3,5.2,5.25,5.7,5.9]}
    df = pd.DataFrame(data)
    
    # Construct the y, X values:
    formula = 'y ~ x*z + C(a) + np.sqrt(x)'
    y, X = dmatrices(formula, data=df, return_type='dataframe')

    # Get the variable names:
    variable_types_names = get_variable_types_names(X)
    
    # Compare to what we *should* have:
    expected_var_types_names = {'numerical':['x','z','x:z', 'np.sqrt(x)'],
                                'categorical':{'C(a)':['C(a)[T.a2]',
                                                       'C(a)[T.a3]']}} 
    # Now test that they are the same:
    all_equal = set(expected_var_types_names.keys()) == set(variable_types_names.keys())
    for key in expected_var_types_names.keys():
        vals_equal = set(expected_var_types_names[key]) == set(variable_types_names[key])
        all_equal = all_equal & vals_equal
        
    return all_equal, variable_types_names, expected_var_types_names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant