Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise versions for rolling_cov, ewmcov and expanding_cov #4950

Merged
merged 5 commits into from
Mar 28, 2014

Conversation

snth
Copy link
Contributor

@snth snth commented Sep 23, 2013

I added the functions rolling_cov_pairwise(), expanding_cov_pairwise() and ewmcov_pairwise(). I also modified the behaviour of rolling_corr_pairwise(), expanding_corr_pairwise() and ewmcorr_pairwise() slightly so that now these can be computed between different DataFrames and the resulting matrices at each time slice will be rectangular rather than square.

I think this is what was asked for in the original discussion that gave rise to issue #1853.

Fixes #1853

@jtratner
Copy link
Contributor

It looks like these have nearly the same signature as the existing
functions. Would it make more sense to add a keyword argument
pairwise=False to the public api and then dispatch to the appropriate
functions based on that?

@snth
Copy link
Contributor Author

snth commented Sep 25, 2013

I agree with you. I provided these implementations under the *_pairwise names to fit in with the current API as described on the Computational Tools page.

While the behaviour of rolling_cov and rolling_corr for the Series/Series and DataFrame/Series seems perfectly fine, the DataFrame/DataFrame behaviour doesn't make much sense to me. I can't think of any scenarios where I would want to match on column names and then describe the result as a rolling_cov rather than a rolling_var. For me a covariance is calculation between two different Series so the name matching doesn't make much sense. The tabular layout of the *_pairwise functions makes it clear which two inputs were used to calculate the covariance or correlation.

However I clearly don't know where people are using this so I rather provided these pairwise implementations under different function names in line with the current rollin_corr_pairwise API and let you core devs decide how to integrate it into the API.

@snth
Copy link
Contributor Author

snth commented Sep 25, 2013

Ok, I've been thinking more about this and think that rather than taking guidance from my own biased preferences, a better approach is to try and minimise cognitive impedance in the API along the lines of the "Don't make me think" principle.

Now DataFrame already has cov and corr methods which compute pairwise covariances and correlations respectively. Presumably the raison d'etre of the rolling_* and expanding_* functions is a convenience to the user (to avoid having to loop over the index manually) as well as to take advantage of any performance improvements that might exist in an on-line algorithm over the naive loop implementation. As such it would seem that the rolling_cov and rolling_corr functions should preserve the API and behaviour of cov and corr respectively while adding an additional dimension to the result introduced by the rolling operation. Therefore {rolling|expanding}_cov and {rolling|expanding}_corr should produce Panels so that the invariant expanding_cov(df, ...).ix[-1] == df.cov() is maintained.

To deal with the case where the non-pairwise version is desired, I just learned that for DataFrame there exists a corrwith method which covers this case. Therefore how about renaming the existing rolling_corr function to rolling_corrwith and introducing covwith as well as {rolling_|expanding_|ewm}covwith. Then the *_pairwise methods I provided could just become the standard API without the _pairwise suffix.

@jtratner
Copy link
Contributor

@snth It could be hard sell to change the existing functions in the way you describe (since they'd produce different behavior with current defaults). if you couldn't change the existing API (i.e., pd.rolling_corr(...) with the current defaults needs to return the same thing as it did in previous versions), are there still ways that you think are simple to make these changes?

It looks like there already is a rolling_corr_pairwise function in the pandas namespace.

That said, why couldn't we change the signature such that if you only passed one object, it computes the rolling pairwise correlation and otherwise computes the correlation between the two objects. Shouldn't be that hard to implement. The only quirk would be that you'd have to either change the signature to: def rolling_corr(arg1, arg2=None, window=None, ...). And then if arg2 is None, do the pairwise correlation. It would mean that you'd always have to use keyword arguments for rolling pairwise functions, but I don't think that's such a big deal.

@snth
Copy link
Contributor Author

snth commented Sep 25, 2013

@jtratner I like your approach of overloading the signature. I can probably do this inside _flex_binary_moment() and then it should become available to all these functions we're discussing here. However I noticed that ewmcov() is currently not using _flex_binary_moment so I'll see if I can change that first.

@snth
Copy link
Contributor Author

snth commented Sep 25, 2013

@jtratner I've tried to incorporate the API changes as you suggested. I still need to update the documentation but I ran out of time. I'll look at that tomorrow.

@snth
Copy link
Contributor Author

snth commented Sep 26, 2013

@jtratner I've updated the documentation for the API changes. Along the way I unified the 3 doc template. Hopefully everything unaffected function docstrings are still intact. I did a couple of spot checks and everything looked ok to me. Would be good to have some more eyeballs on this.

I noticed that the center keyword argument isn't documented but appears in the function signatures. Should we add this?

The only other thing I'd like to add is to make these functions part of the DataFrame API as I'd quite like to be able to say things like df.rolling_cov(window=5). How do you feel about that?

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@snth can you rebase this....looks pretty good

@jorisvandenbossche
Copy link
Member

@snth I just merged a PR (#6362) that also touches the docstrings in this file, which will make it more difficult to rebase (there will be some merge conflicts). If you need any help to resolve them, just let it know!

@snth
Copy link
Contributor Author

snth commented Feb 20, 2014

I tried a rebase of this onto on of the 0.13 release candidates some time ago and there were quite a few conflicts already then. I'm a little busy at the moment so I'll only be able to look at this next week at the earliest.

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@snth when you have a chance

@snth
Copy link
Contributor Author

snth commented Mar 14, 2014

@jreback I finally got around to doing that rebase. I rebased off of master as of this morning for which the whole test suite passed on my machine. Post-rebase tests look all ok as well but let's see what Travis says.

@jorisvandenbossche I tried to incorporate your docstring changes but as I had changed the template slightly I could have missed copying some changes across. If you wouldn't mind having a look to check that everything looks ok.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2014

this look great!

  • can you add a 1-liner release note referncing the issue inImprovements and put in 0.14.0.txt as well.
  • if you want to put a nice example in the docs / v0.14.0 would be great
  • can you squash down

@jorisvandenbossche
Copy link
Member

One remark: due to your (usefull!) refactoring of the docstrings, now the two notes I added (on "by default, result is set to right edge of window ..." and on "the freq keyword is used ...") are added to all functions, also to the expanding_ functions. However, there the center keyword argument is not sensible (although it is in the function signature, it is also not in the docstring). So the note on that should also only be shown in the rolling_ docstrings.

@snth
Copy link
Contributor Author

snth commented Mar 17, 2014

Thanks @jorisvandenbossche . Please check if the last commit fixed this.

@jreback If @jorisvandenbossche is happy with the docstrings then I'll squash down and add a release note.

@snth
Copy link
Contributor Author

snth commented Mar 17, 2014

A quick note about the additional behaviour I introduced.

Basically I took the pairwise behaviour from rolling_corr_pairwise() and moved it into _flex_binary_moment() in order to make it available to rolling_cov(), ewmcov(), expanding_cov(), ...

I am no statistician and cannot promise that this is statistically sound. In particular if there are missing values in the data then a different number of datapoints will be used in the calculation of different entries of the covariance matrices and these covariance matrices are not guaranteed to be positive semi-definite.

I just want to point that out as often the availability of something is taken as an implicit guarantee that it is correct, especially by novice users who are not trained in the field. In this case the user is responsible for ensuring that the results are suitable for his or her use-case.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2014

  • can you add a blurb to the docstring about this (and in the doc in the moments section)?
  • can you add a release note / v0.14.0.txt
  • squash down pls

thanks

otherwise looks good

@snth
Copy link
Contributor Author

snth commented Mar 27, 2014

I'm busy putting in release note. Should be done tomorrow.

@snth
Copy link
Contributor Author

snth commented Mar 28, 2014

I added a release note and some example usage for pairwise=True to the docs.

I also marked rolling_corr_pairwise() and expanding_corr_pairwise() as deprecated since these simply call rolling_corr() and expanding_corr() respectively. If you don't want to deprecate these then I will simply omit that commit when I squash everything down.

@jorisvandenbossche
Copy link
Member

I think you need to rebase (it can't be automatically merged)

correls[df.index[-50]]

.. note::

This was previously available through ``rolling_corr_pairwise`` which is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a prior to version 0.14

@jreback
Copy link
Contributor

jreback commented Mar 28, 2014

minor comments, pls rebase and I think good to go

@jreback
Copy link
Contributor

jreback commented Mar 28, 2014

The deprecation is fine (I marked a reminder to remove in #6581 at some point in the future)

@jreback
Copy link
Contributor

jreback commented Mar 28, 2014

looks good...ping when green

@snth
Copy link
Contributor Author

snth commented Mar 28, 2014

@jtratner Travis build is green.

jreback added a commit that referenced this pull request Mar 28, 2014
Pairwise versions for rolling_cov, ewmcov and expanding_cov
@jreback jreback merged commit aa166bf into pandas-dev:master Mar 28, 2014
@jreback
Copy link
Contributor

jreback commented Mar 28, 2014

@snth awesome! thanks!

@jreback
Copy link
Contributor

jreback commented Mar 29, 2014

http://pandas-docs.github.io/pandas-docs-travis/computation.html#moving-rolling-statistics-moments

the mention of the pairwise should be removed from the list for rolling and expanding yes?
eg rolling_corr_pairwise

@jreback
Copy link
Contributor

jreback commented Mar 30, 2014

I deleted those 2 references, pls review and submit a fix if the docs don't look correct

http://pandas-docs.github.io/pandas-docs-travis/computation.html#moving-rolling-statistics-moments

@snth
Copy link
Contributor Author

snth commented Mar 31, 2014

You're right, the mention of the 2 *_corr_pairwise() functions should have been deleted. Thanks for sorting it out. Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to estimate full covariance matrices a la ewmcov/rolling_cov
4 participants