New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats.gaussian_kde: replace use of inv_cov in pdf #16692
Conversation
I think all the failures are the same -
|
Rather than all the permutations required to replace |
@steppi if you also like linear algebra, this may be interesting to you. |
Putting this next in my queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to fix series of errors like:
___________________ test_kde_output_dtype[float128-float128] ___________________
[gw1] darwin -- Python 3.10.5 /Users/runner/miniconda3/envs/scipy-dev/bin/python
scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
result = k(points)
bw = 3.0
bw_type = <class 'numpy.float128'>
dataset = array([0., 1., 2., 3., 4.], dtype=float128)
dtype = <class 'numpy.float128'>
k = <scipy.stats._kde.gaussian_kde object at 0x140000160>
points = array([0., 1., 2., 3., 4.], dtype=float128)
weights = array([0., 1., 2., 3., 4.], dtype=float128)
scipy/stats/_kde.py:270: in evaluate
result = gaussian_kernel_estimate[spec](
d = 1
itemsize = 16
m = 5
output_dtype = <class 'numpy.float128'>
points = array([[0., 1., 2., 3., 4.]], dtype=float128)
self = <scipy.stats._kde.gaussian_kde object at 0x140000160>
spec = 'long double'
_stats.pyx:748: in scipy.stats._stats.gaussian_kernel_estimate
???
E ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time to check everything, but things look OK mathematically so far. See the suggestion about adding some comments. I found the permutations a little inscrutable at first, so I think some comments to explain what's going on and link to more details would be helpful.
I should have time to complete my review next weekend.
self._data_cho_cov = linalg.cholesky( | ||
self._data_covariance[::-1, ::-1]).T[::-1, ::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's enough going on here that there should be some comments to explain things. It took me a bit to figure out what's happening.
Just to make I'm on the same page, here are the details as I understand them:
If we know
If
and thus
where we've used that
This means that if we know the Cholesky factor
I think we should have a comment explaining the types of equations we want to be able to solve
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely does deserve an explanation, and I meant to include one before you got to this. Sorry to make you find it on your own. Yes, I think that is the original post I followed. I'll write a bit about the motivation and link to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. The explanation is very clear now. I think everything is in good shape now but still want to double check carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow the whole argument but if applies here, use the lower=False
keyword for starting with an upper triangular in calling cholesky
. Might save a column swap or two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow the whole argument but if applies here, use the
lower=False
keyword for starting with an upper triangular in callingcholesky
. Might save a column swap or two.
It's a good thought but isn't quite what we want. According to the documentation for the underlying Lapack function, if
cholesky(Gamma, lower=True)
will return L
and cholesky(Gamma, lower=False)
will return cholesky
function can't do this so we're required to do the trick with reversing the rows and columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, but I guess it will save us one transpose when computing the Cholesky decomposition of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. Nice work. I suggested one more place that I think could use a comment but it’s fine if you think it isn’t needed.
@property | ||
def inv_cov(self): | ||
self.factor = self.covariance_factor() | ||
self._data_covariance = atleast_2d(cov(self.dataset, rowvar=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we recompute self._data_covariance
here? Does it help to maintain backwards compatibility for existing subclasses? If this is needed, could probably use a comment to explain why but it’s fine if you think it isn’t necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I assumed that since _compute_covariance
used to re-calculate the covariance every time, there must have been a reason. Otherwise, why not do it just once in the __init__
method? As it was, it was re-calculated every time set_bandwidth
was called, so maybe people use set_bandwidth
to recalculate everything after modifying the public attribute dataset
? I don't really know, but figured it would be safer this way.
And maybe subconsciously I want use of inv_cov
to be as slow as possible : ) See the discussion in the original incarnation of this issue - #5087 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. That makes sense.
@property | ||
def inv_cov(self): | ||
self.factor = self.covariance_factor() | ||
self._data_covariance = atleast_2d(cov(self.dataset, rowvar=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I assumed that since _compute_covariance
used to re-calculate the covariance every time, there must have been a reason. Otherwise, why not do it just once in the __init__
method? As it was, it was re-calculated every time set_bandwidth
was called, so maybe people use set_bandwidth
to recalculate everything after modifying the public attribute dataset
? I don't really know, but figured it would be safer this way.
And maybe subconsciously I want use of inv_cov
to be as slow as possible : ) See the discussion in the original incarnation of this issue - #5087 (comment).
Reference issue
Supersedes gh-5087
What does this implement/fix?
gh-5087 proposed replacing use of the inverse covariance matrix with the Cholesky decomposition of the covariance matrix throughout
gaussian_kde
to improve speed and avoid numerical instabilities associate with matrix inversion. There didn't seem to be disagreement from a technical standpoint; it looks like development just stopped.This PR implements the suggestion for
gaussian_kde.pdf
. Sincelogpdf
is being Cythonized in gh-15493, I'll leave that alone to avoid merge conflicts.Additional information
Here is the timing of the KDE benchmarks in main (results from CI of gh-16684):
stats.GaussianKDE.time_gaussian_kde_evaluate_few_points
- 1.31±0msstats.GaussianKDE.time_gaussian_kde_evaluate_many_points
- 1.61±0sIn this PR:
stats.GaussianKDE.time_gaussian_kde_evaluate_few_points
- 645±0μsstats.GaussianKDE.time_gaussian_kde_evaluate_many_points
- 905±0msWe might be able to do better with a closer look at new Cython/Python interactions.