Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing formulaic, a high-performance patsy "competitor" #157

Open
matthewwardrop opened this issue Jul 6, 2020 · 2 comments
Open

Comments

@matthewwardrop
Copy link
Collaborator

Greetings all,

Late last year I had the need to generate sparse model matrices from large pandas DataFrames (dense model matrices would not fit in memory for the dataset I was using). I originally set about trying to patch patsy, but the code was not set up to allow overriding individual methods, and since I felt it would be a didactic experience in any case, I decided to rewrite something like patsy from scratch. The result is Formulaic.

I wasn't expecting much more than the addition of sparse matrix support, but it seems I've also managed to improve the performance of model matrix generation by (in many cases) orders of magnitude, even beating R in many cases. I'm in the process of writing up documentation, and there is some low-hanging fruit in terms of improvements, but I'd love to get some eyes on the project, and would welcome feedback.

@njsmith
Copy link
Member

njsmith commented Jul 6, 2020

Hey, that's pretty cool! patsy has turned out to be pretty solid, so my lack of time to maintain it hasn't been a huge problem, but there definitely is a lot of potential for something better.

Random questions:

  • How compatible is the formula language with patsy's? How compatible is the API?
  • Do you have nice error messages on parse errors? (I just really like this patsy feature ;-))
  • Did you copy patsy's novel model matrix building algorithm that fixes some bugs in R?
  • Given this vs this, it seems incorporating your changes into patsy somehow might be a good way to get them out to users ... how viable do you think that would be?

@matthewwardrop
Copy link
Collaborator Author

matthewwardrop commented Jul 6, 2020

Wow... I wasn't expecting so fast a reply! :).

How compatible is the formula language with patsy's? How compatible is the API?
The formula languages should be compatible where they overlap. Formulaic added some extra syntax for quoting names and nested Python expressions; for example:

y ~ `field|with@funky+symbols` + {np.dot(`b`, `c|funky`)}

It also allows arbitrarily many ~, which abuses the notation a bit, but simplifies some use-cases.

Some features are not yet present, for example the specific implementations of various categorical/spline encoding strategies (but that is less a framework thing atm then those functions simply not being implemented; the framework supports everything needed by those encoding algorithms... I just haven't needed it in my use-case).

The API is fully object-oriented with hooks to allow users to override behaviour with their own subclasses of various pieces, and reuse the rest of the tooling as is. As such, the API is very different. With that said, I've toyed with the notion of creating some patsy shim wrappers that expose the same API as patsy for common operations.

Do you have nice error messages on parse errors? (I just really like this patsy feature ;-))
No. I also like this feature of patsy, and so I programmed the parser with all the context it needs to provide such messages, I just haven't gotten around to adding them yet.

Did you copy patsy's novel model matrix building algorithm that fixes some bugs in R?
By this do you mean explicitly worrying about the vector-space spanned by the various terms? If so, yes. I loved the documentation you provided in patsy around this, and how it actually results in a sensible model matrix ❤️. You can see my implementation of these things here. With the abstractions provided elsewhere, it boils down to 20 lines of code. Love it!

Given this vs this, it seems incorporating your changes into patsy somehow might be a good way to get them out to users ... how viable do you think that would be?
That formulaic has even 20 downloads a day is likely an artifact of the mirrors pulling the packages! Before today I haven't even mentioned it to anyone outside of work. It might be possible to add these features to patsy, but I don't think I could do it without hugely invasive patches or lots of ugly if/else statements around about the place. That's why I started from scratch, with a focus on making everything object-oriented and extensible. You'd have to implement your own encoding transforms, but formulaic doesn't even force you to use pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants