Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deferring Evaluation of Terms #149

Open
bashtage opened this issue Jun 12, 2019 · 2 comments
Open

Deferring Evaluation of Terms #149

bashtage opened this issue Jun 12, 2019 · 2 comments

Comments

@bashtage
Copy link
Contributor

I have written a function called AbsorbingLS that can absorb a large number (millions) of categorical variables or categorical interactions. It is implemented using a Frisch-Waugh-Lovell step where the categoricals are handled using scipy sparse matrices. I would like to add a formula interface. Suppose I have a function A() that indicates that a variable should be absorbed, is there any place to intervene in the formula parsing for a formula that looks like y ~ 1 + x + A(cat) + A(cat*x)?

I can use another syntax. In an instrumental variable regression I use the syntax y ~ 1 + x1 + x2 + [x3 ~ z1 + z2] which is used to determine the configuration of the 2 required regressions. This works fine since it is easy to parse the [] and then it is a couple of standard calls. This approach doesn't obviously work here since I must avoid creating any arrays. I could use a similar structure here, so something like y ~ 1 + x + {cat +cat*x} ({} for simplicity in parsing) which I would then need to find a good way to turn cat +cat*x into usable terms (w/o populating dense arrays).

Any suggestions on how I could write a formula where I could intercept it using something like the pseudocode

1. Patsy  parses to terms
2. I remove and terms that are absorbed, which are too large to express as dense arrays
3. Patsy parses the non-absorbed terms, which have reasonable sizes as dense matrices

Any suggestions are appreciated.

@njsmith
Copy link
Member

njsmith commented Jun 12, 2019

patsy has an extensible parser – see infix_parser.py and parse_formula.py. It generates an abstract syntax tree, and then there's an interpreter that executes the AST to generate term lists – see desc.py.

Both parts are extensible in principle – being able to support use cases like yours was included in the original design. parse_formula is driven by an operator table, and it even takes an argument to allow extra operators to be added. Evaluator is driven by a table too, and has some pieces included with the eye towards allowing syntax extensibility. And then you could take out the term lists + your special representations, and do whatever you want with them (e.g. run the term lists through patsy's matrix builder, while you handle the other parts in whatever way you want).

The parser currently uses the shunting-yard algorithm which is specialized for easily parsing infix arithmetic with precedence and parentheses, so it's easiest to extend by adding new operators, but it could potentially be extended to handle other things like your special [ grouping, or magic function calls like A(...).

But... none of this has ever actually been used, so it probably doesn't quite work. And unfortunately patsy has been essentially unmaintained for years now, so I'm not sure how you'd get there from here. If you have cash to burn I guess you could hire me...

@bashtage
Copy link
Contributor Author

Thanks. I'll see if I can figure out anything. Assuming I can't (which is reasonable), I can always use a half-way mixed interface with formula for the dense part and a list for the sparse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants