Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to get dmatrix to drop all-zero columns? #161

Open
DWesl opened this issue Aug 30, 2020 · 1 comment
Open

Is there a way to get dmatrix to drop all-zero columns? #161

DWesl opened this issue Aug 30, 2020 · 1 comment

Comments

@DWesl
Copy link

DWesl commented Aug 30, 2020

I have an experiment design that does not include all combinations of its categorical variables, and ran into some difficulties getting a full-rank design matrix for statsmodels. I included a simplified version below.

import numpy as np
import numpy.linalg as la
import pandas as pd
import patsy

index_vals = tuple("abc")
level_names = list("ABD")
n_samples = 2


def describe_design_matrix(design_matrix):
    print("Shape:", design_matrix.shape)
    print("Rank: ", la.matrix_rank(design_matrix))
    print(
        "Approximate condition number: {0:.2g}".format(
            np.divide(*la.svd(design_matrix)[1][[0, -1]])
        )
    )


ds_simple = pd.DataFrame(
    index=pd.MultiIndex.from_product(
        [index_vals] * len(level_names) + [range(n_samples)],
        names=level_names + ["sample"],
    ),
    columns=["y"],
    data=np.random.randn(len(index_vals) ** len(level_names) * n_samples),
).reset_index()

print("All sampled")
simple_X = patsy.dmatrices("y ~ (A + B + D) ** 3", ds_simple)[1]
describe_design_matrix(simple_X)

print("Only some sampled")
simple_X = patsy.dmatrices(
    "y ~ (A + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)

print("Reduced X")
simple_X = patsy.dmatrices(
    "y ~ (A + B + D) ** 3",
    ds_simple.query("A != 'a' or B == 'a'"),
    return_type="dataframe",
)[1]
reduced_X = simple_X.loc[
    :, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]
]
describe_design_matrix(reduced_X)

print("Only some sampled: alternate method")
simple_X = patsy.dmatrices(
    "y ~ (C(A, Treatment('b')) + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)
print("Number of nonzero elements:", (simple_X != 0).sum(axis=0))
print("Number of all-zero columns:", np.count_nonzero((simple_X != 0).sum(axis=0) == 0))

print("Reduced X: alternate method")
simple_X = patsy.dmatrices(
    "y ~ (C(A, Treatment('b')) + B + D) ** 3",
    ds_simple.query("A != 'a' or B == 'a'"),
    return_type="dataframe",
)[1]
reduced_X = simple_X.loc[
    :,
    [
        col
        for col in simple_X.columns
        if not col.startswith("C(A, Treatment('b'))[T.a]:B")
    ],
]
describe_design_matrix(reduced_X)

produces as output

All sampled
Shape: (54, 27)
Rank:  27
Approximate condition number: 52
Only some sampled
Shape: (42, 27)
Rank:  21
Approximate condition number: 3.8e+16
Reduced X
Shape: (42, 21)
Rank:  21
Approximate condition number: 37
Only some sampled: alternate method
Shape: (42, 27)
Rank:  21
Approximate condition number: 3.4e+16
Number of nonzero elements: [42  6 18 12 12 14 14  0  6  0  6  2  6  2  6  4  4  4  4  0  2  0  2  0
  2  0  2]
Number of all-zero columns: 6
Reduced X: alternate method
Shape: (42, 21)
Rank:  21
Approximate condition number: 39

I don't mind spending the time to find the representation that produces all-zero columns, but there doesn't seem to be a way within patsy to say "I know some of these columns are going to be all zeros" or "These columns will be linear dependent on others". Since some statsmodels functions require the formula information from patsy.DesignInfo objects, I wanted to see what could be done within patsy.

matthewwardrop/formulaic#19 is a related issue, with some discussion of how to generalize the "Reduced X" method in the script.

@DWesl
Copy link
Author

DWesl commented Aug 30, 2020

Is there a way to get a stateful transform to operate on a categorical interaction term (e.g. drop_unused_interactions(A:B), drop_unused_interactions((A + B) ** 2), or drop_all_zero_columns((C(A, Treatment('b')) + B) ** 2))? That might be the easiest way to accomplish this if such a thing is possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant