Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builtins for numpy recarrays #142

Open
icam0 opened this issue Feb 4, 2019 · 4 comments
Open

Builtins for numpy recarrays #142

icam0 opened this issue Feb 4, 2019 · 4 comments

Comments

@icam0
Copy link

icam0 commented Feb 4, 2019

When trying to use the patsy builtin identity matrix I()

adding two features the numpy recarray throws an error while the pandas equivalent executes without a problem. Code to reproduce the error:

from patsy import dmatrix
import numpy as np
import pandas as pd
recarray = np.array([(1.0, 2), (3.0, 4)], dtype=[('x', float), ('y', int)])
df = pd.DataFrame.from_records(recarray)
result_df = dmatrix("I(x+y)-1",df)
result_rec = dmatrix("I(x+y)-1",recarray)

python 3.6.5
patsy 0.5.0
pandas 0.23.4
numpy 1.14.1

traceback:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
    return f(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 166, in eval
    + self._namespaces))
  File "<string>", line 1, in <module>
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
ValueError: no field of name I
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydev_run_in_console.py", line 52, in run_file
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/daviddejong/PycharmProjects/quick_test/patsy_test.py", line 7, in <module>
    result_rec = dmatrix("I(x+y)-1",recarray)
  File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 291, in dmatrix
    NA_action, return_type)
  File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "/anaconda3/lib/python3.6/site-packages/patsy/build.py", line 696, in design_matrix_builders
    NA_action)
  File "/anaconda3/lib/python3.6/site-packages/patsy/build.py", line 443, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 566, in eval
    data)
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 551, in _eval
    inner_namespace=inner_namespace)
  File "/anaconda3/lib/python3.6/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
    exec("raise new_exc from e")
  File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: ValueError: no field of name I
    I(x+y)-1
    ^^^^^^
@njsmith
Copy link
Member

njsmith commented Feb 4, 2019

Looks like the issue is that recarray objects don't actually follow the standard Python "mapping" interface, which is what patsy expects from the data object – in particular, trying to access an undefined field name raises ValueError, instead of KeyError, so patsy can't tell whether I is supposed to be a field name or what.

Is this a problem? Pandas dataframes are better than recarrays in pretty every way...?

@icam0
Copy link
Author

icam0 commented Feb 4, 2019

Apparently, it's raising a ValueError to maintain backwards compatibility as explained here. Maybe it's worthwhile to check the dtype before catching the exception. Also, the example is a structured array, not a record array. I think it's worthwhile to add this behaviour to make sure it works on numpy structured/record arrays because a lot of functionality fails otherwise. I raised this issue because the documentation states:

You may prefer to store your data in a pandas DataFrame, or a numpy record array… whatever makes you happy.

Personally, I was using this in an environment where a (design) choice was made to use record arrays instead of pandas dataframes.

@ForceBru
Copy link

I'm facing the same issue with this code:

>>> import statsmodels.formula.api as smf
>>> import numpy as np
>>> x = np.linspace(0.001, 5, 200)
>>> y = (0.3 * x**3 + 1.2 * x**2 + 70/x**4) * 1.1 * np.exp(0.1)
>>> data = np.array([y, x], dtype=[('y', np.float64), ('x', np.float64)])
>>> model = smf.ols(formula='y ~ I(x**3) + I(x**2) + I(x**4)', data=data)
Traceback (most recent call last):
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
    return f(*args, **kwargs)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 166, in eval
    + self._namespaces))
  File "<string>", line 1, in <module>
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
ValueError: no field of name I

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "DataScienceVenv/lib/python3.7/site-packages/statsmodels/base/model.py", line 170, in from_formula
    missing=missing)
  File "DataScienceVenv/lib/python3.7/site-packages/statsmodels/formula/formulatools.py", line 67, in handle_formula_data
    NA_action=na_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 310, in dmatrices
    NA_action, return_type)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/build.py", line 696, in design_matrix_builders
    NA_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/build.py", line 443, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
    data)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
    inner_namespace=inner_namespace)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
    exec("raise new_exc from e")
  File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: ValueError: no field of name I
    y ~ I(x**3) + I(x**2) + I(x**4)
                            ^^^^^^^
>>>

The documentation for statsmodels.formula.api.ols explicitly says (emphasis mine):

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.

Yet in fact, structured arrays don't work, or not all features of the formula interface can be used with them, which is highly confusing.

@bashtage
Copy link
Contributor

bashtage commented Feb 5, 2021

Just an an FTI, statsmodels no longer officially supports recarrays. Any references remaining as vestigial and should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants