Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats.gaussian_kde: replace use of inv_cov in pdf #16692

Merged
merged 13 commits into from Aug 26, 2022
Merged

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Jul 24, 2022

Reference issue

Supersedes gh-5087

What does this implement/fix?

gh-5087 proposed replacing use of the inverse covariance matrix with the Cholesky decomposition of the covariance matrix throughout gaussian_kde to improve speed and avoid numerical instabilities associate with matrix inversion. There didn't seem to be disagreement from a technical standpoint; it looks like development just stopped.

This PR implements the suggestion for gaussian_kde.pdf. Since logpdf is being Cythonized in gh-15493, I'll leave that alone to avoid merge conflicts.

Additional information

Here is the timing of the KDE benchmarks in main (results from CI of gh-16684):
stats.GaussianKDE.time_gaussian_kde_evaluate_few_points - 1.31±0ms
stats.GaussianKDE.time_gaussian_kde_evaluate_many_points - 1.61±0s

In this PR:
stats.GaussianKDE.time_gaussian_kde_evaluate_few_points - 645±0μs
stats.GaussianKDE.time_gaussian_kde_evaluate_many_points - 905±0ms

We might be able to do better with a closer look at new Cython/Python interactions.

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Jul 24, 2022
@mdhaber mdhaber requested a review from rkern July 24, 2022 04:53
@mdhaber
Copy link
Contributor Author

mdhaber commented Jul 24, 2022

I think all the failures are the same - ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'. Hopefully an easy fix? But I'm developing on Windows, so I'm not seeing it locally.

=================================== FAILURES ===================================
___________________ test_kde_output_dtype[float128-float32] ____________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3.8-dbg
../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 3.0
        bw_type    = <class 'numpy.float32'>
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x7f48089a7960>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
../testenv/lib/python3.8/site-packages/scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x7f48089a7960>
        spec       = 'long double'
_stats.pyx:741: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
        __builtins__ = <builtins>
        __doc__    = None
        __file__   = '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so'
        __loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>
        __name__   = 'scipy.stats._stats'
        __package__ = 'scipy.stats'
        __pyx_capi__ = {'_genhyperbolic_logpdf': <capsule object "double (double, void *)" at 0x7f4843fe4c20>, '_genhyperbolic_pdf': <capsule...at 0x7f4843fe49f0>, '_studentized_range_cdf': <capsule object "double (int, double *, void *)" at 0x7f4843fe4a40>, ...}
        __pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
        __spec__   = ModuleSpec(name='scipy.stats._stats', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>.../runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so')
        __test__   = {}
        _center_distance_matrix = <built-in function _center_distance_matrix>
        _kendall_dis = <built-in function _kendall_dis>
        _local_correlations = <built-in function _local_correlations>
        _local_covariance = <built-in function _local_covariance>
        _rank_distance_matrix = <built-in function _rank_distance_matrix>
        _studentized_range_cdf_logconst = <built-in function _studentized_range_cdf_logconst>
        _studentized_range_pdf_logconst = <built-in function _studentized_range_pdf_logconst>
        _toint64   = <built-in function _toint64>
        _transform_distance_matrix = <built-in function _transform_distance_matrix>
        _weightedrankedtau = <cyfunction _weightedrankedtau at 0x7f4843e53050>
        gaussian_kernel_estimate = <cyfunction gaussian_kernel_estimate at 0x7f4843e53450>
        genhyperbolic_logpdf = <built-in function genhyperbolic_logpdf>
        genhyperbolic_pdf = <built-in function genhyperbolic_pdf>
        geninvgauss_logpdf = <built-in function geninvgauss_logpdf>
        linalg     = <module 'scipy.linalg' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/linalg/__init__.py'>
        np         = <module 'numpy' from '/home/runner/.local/lib/python3.8/site-packages/numpy/__init__.py'>
        scipy      = <module 'scipy' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/__init__.py'>
        von_mises_cdf = <built-in function von_mises_cdf>
        warnings   = <module 'warnings' from '/usr/lib/python3.8/warnings.py'>
___________________ test_kde_output_dtype[float128-float64] ____________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3.8-dbg
../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 3.0
        bw_type    = <class 'numpy.float64'>
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x7f48080ee960>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
../testenv/lib/python3.8/site-packages/scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x7f48080ee960>
        spec       = 'long double'
_stats.pyx:741: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
        __builtins__ = <builtins>
        __doc__    = None
        __file__   = '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so'
        __loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>
        __name__   = 'scipy.stats._stats'
        __package__ = 'scipy.stats'
        __pyx_capi__ = {'_genhyperbolic_logpdf': <capsule object "double (double, void *)" at 0x7f4843fe4c20>, '_genhyperbolic_pdf': <capsule...at 0x7f4843fe49f0>, '_studentized_range_cdf': <capsule object "double (int, double *, void *)" at 0x7f4843fe4a40>, ...}
        __pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
        __spec__   = ModuleSpec(name='scipy.stats._stats', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>.../runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so')
        __test__   = {}
        _center_distance_matrix = <built-in function _center_distance_matrix>
        _kendall_dis = <built-in function _kendall_dis>
        _local_correlations = <built-in function _local_correlations>
        _local_covariance = <built-in function _local_covariance>
        _rank_distance_matrix = <built-in function _rank_distance_matrix>
        _studentized_range_cdf_logconst = <built-in function _studentized_range_cdf_logconst>
        _studentized_range_pdf_logconst = <built-in function _studentized_range_pdf_logconst>
        _toint64   = <built-in function _toint64>
        _transform_distance_matrix = <built-in function _transform_distance_matrix>
        _weightedrankedtau = <cyfunction _weightedrankedtau at 0x7f4843e53050>
        gaussian_kernel_estimate = <cyfunction gaussian_kernel_estimate at 0x7f4843e53450>
        genhyperbolic_logpdf = <built-in function genhyperbolic_logpdf>
        genhyperbolic_pdf = <built-in function genhyperbolic_pdf>
        geninvgauss_logpdf = <built-in function geninvgauss_logpdf>
        linalg     = <module 'scipy.linalg' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/linalg/__init__.py'>
        np         = <module 'numpy' from '/home/runner/.local/lib/python3.8/site-packages/numpy/__init__.py'>
        scipy      = <module 'scipy' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/__init__.py'>
        von_mises_cdf = <built-in function von_mises_cdf>
        warnings   = <module 'warnings' from '/usr/lib/python3.8/warnings.py'>
___________________ test_kde_output_dtype[float128-float128] ___________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3.8-dbg
../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 3.0
        bw_type    = <class 'numpy.float128'>
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x7f481a01b410>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
../testenv/lib/python3.8/site-packages/scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x7f481a01b410>
        spec       = 'long double'
_stats.pyx:741: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
        __builtins__ = <builtins>
        __doc__    = None
        __file__   = '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so'
        __loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>
        __name__   = 'scipy.stats._stats'
        __package__ = 'scipy.stats'
        __pyx_capi__ = {'_genhyperbolic_logpdf': <capsule object "double (double, void *)" at 0x7f4843fe4c20>, '_genhyperbolic_pdf': <capsule...at 0x7f4843fe49f0>, '_studentized_range_cdf': <capsule object "double (int, double *, void *)" at 0x7f4843fe4a40>, ...}
        __pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
        __spec__   = ModuleSpec(name='scipy.stats._stats', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>.../runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so')
        __test__   = {}
        _center_distance_matrix = <built-in function _center_distance_matrix>
        _kendall_dis = <built-in function _kendall_dis>
        _local_correlations = <built-in function _local_correlations>
        _local_covariance = <built-in function _local_covariance>
        _rank_distance_matrix = <built-in function _rank_distance_matrix>
        _studentized_range_cdf_logconst = <built-in function _studentized_range_cdf_logconst>
        _studentized_range_pdf_logconst = <built-in function _studentized_range_pdf_logconst>
        _toint64   = <built-in function _toint64>
        _transform_distance_matrix = <built-in function _transform_distance_matrix>
        _weightedrankedtau = <cyfunction _weightedrankedtau at 0x7f4843e53050>
        gaussian_kernel_estimate = <cyfunction gaussian_kernel_estimate at 0x7f4843e53450>
        genhyperbolic_logpdf = <built-in function genhyperbolic_logpdf>
        genhyperbolic_pdf = <built-in function genhyperbolic_pdf>
        geninvgauss_logpdf = <built-in function geninvgauss_logpdf>
        linalg     = <module 'scipy.linalg' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/linalg/__init__.py'>
        np         = <module 'numpy' from '/home/runner/.local/lib/python3.8/site-packages/numpy/__init__.py'>
        scipy      = <module 'scipy' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/__init__.py'>
        von_mises_cdf = <built-in function von_mises_cdf>
        warnings   = <module 'warnings' from '/usr/lib/python3.8/warnings.py'>
____________________ test_kde_output_dtype[float128-int32] _____________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3.8-dbg
../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 3
        bw_type    = <class 'numpy.int32'>
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x7f4801001320>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
../testenv/lib/python3.8/site-packages/scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x7f4801001320>
        spec       = 'long double'
_stats.pyx:741: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
        __builtins__ = <builtins>
        __doc__    = None
        __file__   = '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so'
        __loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>
        __name__   = 'scipy.stats._stats'
        __package__ = 'scipy.stats'
        __pyx_capi__ = {'_genhyperbolic_logpdf': <capsule object "double (double, void *)" at 0x7f4843fe4c20>, '_genhyperbolic_pdf': <capsule...at 0x7f4843fe49f0>, '_studentized_range_cdf': <capsule object "double (int, double *, void *)" at 0x7f4843fe4a40>, ...}
        __pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
        __spec__   = ModuleSpec(name='scipy.stats._stats', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>.../runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so')
        __test__   = {}
        _center_distance_matrix = <built-in function _center_distance_matrix>
        _kendall_dis = <built-in function _kendall_dis>
        _local_correlations = <built-in function _local_correlations>
        _local_covariance = <built-in function _local_covariance>
        _rank_distance_matrix = <built-in function _rank_distance_matrix>
        _studentized_range_cdf_logconst = <built-in function _studentized_range_cdf_logconst>
        _studentized_range_pdf_logconst = <built-in function _studentized_range_pdf_logconst>
        _toint64   = <built-in function _toint64>
        _transform_distance_matrix = <built-in function _transform_distance_matrix>
        _weightedrankedtau = <cyfunction _weightedrankedtau at 0x7f4843e53050>
        gaussian_kernel_estimate = <cyfunction gaussian_kernel_estimate at 0x7f4843e53450>
        genhyperbolic_logpdf = <built-in function genhyperbolic_logpdf>
        genhyperbolic_pdf = <built-in function genhyperbolic_pdf>
        geninvgauss_logpdf = <built-in function geninvgauss_logpdf>
        linalg     = <module 'scipy.linalg' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/linalg/__init__.py'>
        np         = <module 'numpy' from '/home/runner/.local/lib/python3.8/site-packages/numpy/__init__.py'>
        scipy      = <module 'scipy' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/__init__.py'>
        von_mises_cdf = <built-in function von_mises_cdf>
        warnings   = <module 'warnings' from '/usr/lib/python3.8/warnings.py'>
____________________ test_kde_output_dtype[float128-int64] _____________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3.8-dbg
../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 3
        bw_type    = <class 'numpy.int64'>
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x7f480100afa0>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
../testenv/lib/python3.8/site-packages/scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x7f480100afa0>
        spec       = 'long double'
_stats.pyx:741: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
        __builtins__ = <builtins>
        __doc__    = None
        __file__   = '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so'
        __loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>
        __name__   = 'scipy.stats._stats'
        __package__ = 'scipy.stats'
        __pyx_capi__ = {'_genhyperbolic_logpdf': <capsule object "double (double, void *)" at 0x7f4843fe4c20>, '_genhyperbolic_pdf': <capsule...at 0x7f4843fe49f0>, '_studentized_range_cdf': <capsule object "double (int, double *, void *)" at 0x7f4843fe4a40>, ...}
        __pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
        __spec__   = ModuleSpec(name='scipy.stats._stats', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>.../runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so')
        __test__   = {}
        _center_distance_matrix = <built-in function _center_distance_matrix>
        _kendall_dis = <built-in function _kendall_dis>
        _local_correlations = <built-in function _local_correlations>
        _local_covariance = <built-in function _local_covariance>
        _rank_distance_matrix = <built-in function _rank_distance_matrix>
        _studentized_range_cdf_logconst = <built-in function _studentized_range_cdf_logconst>
        _studentized_range_pdf_logconst = <built-in function _studentized_range_pdf_logconst>
        _toint64   = <built-in function _toint64>
        _transform_distance_matrix = <built-in function _transform_distance_matrix>
        _weightedrankedtau = <cyfunction _weightedrankedtau at 0x7f4843e53050>
        gaussian_kernel_estimate = <cyfunction gaussian_kernel_estimate at 0x7f4843e53450>
        genhyperbolic_logpdf = <built-in function genhyperbolic_logpdf>
        genhyperbolic_pdf = <built-in function genhyperbolic_pdf>
        geninvgauss_logpdf = <built-in function geninvgauss_logpdf>
        linalg     = <module 'scipy.linalg' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/linalg/__init__.py'>
        np         = <module 'numpy' from '/home/runner/.local/lib/python3.8/site-packages/numpy/__init__.py'>
        scipy      = <module 'scipy' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/__init__.py'>
        von_mises_cdf = <built-in function von_mises_cdf>
        warnings   = <module 'warnings' from '/usr/lib/python3.8/warnings.py'>
____________________ test_kde_output_dtype[float128-scott] _____________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3.8-dbg
../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 'scott'
        bw_type    = 'scott'
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x7f48080cdc80>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
../testenv/lib/python3.8/site-packages/scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x7f48080cdc80>
        spec       = 'long double'
_stats.pyx:741: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
        __builtins__ = <builtins>
        __doc__    = None
        __file__   = '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so'
        __loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>
        __name__   = 'scipy.stats._stats'
        __package__ = 'scipy.stats'
        __pyx_capi__ = {'_genhyperbolic_logpdf': <capsule object "double (double, void *)" at 0x7f4843fe4c20>, '_genhyperbolic_pdf': <capsule...at 0x7f4843fe49f0>, '_studentized_range_cdf': <capsule object "double (int, double *, void *)" at 0x7f4843fe4a40>, ...}
        __pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
        __spec__   = ModuleSpec(name='scipy.stats._stats', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>.../runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so')
        __test__   = {}
        _center_distance_matrix = <built-in function _center_distance_matrix>
        _kendall_dis = <built-in function _kendall_dis>
        _local_correlations = <built-in function _local_correlations>
        _local_covariance = <built-in function _local_covariance>
        _rank_distance_matrix = <built-in function _rank_distance_matrix>
        _studentized_range_cdf_logconst = <built-in function _studentized_range_cdf_logconst>
        _studentized_range_pdf_logconst = <built-in function _studentized_range_pdf_logconst>
        _toint64   = <built-in function _toint64>
        _transform_distance_matrix = <built-in function _transform_distance_matrix>
        _weightedrankedtau = <cyfunction _weightedrankedtau at 0x7f4843e53050>
        gaussian_kernel_estimate = <cyfunction gaussian_kernel_estimate at 0x7f4843e53450>
        genhyperbolic_logpdf = <built-in function genhyperbolic_logpdf>
        genhyperbolic_pdf = <built-in function genhyperbolic_pdf>
        geninvgauss_logpdf = <built-in function geninvgauss_logpdf>
        linalg     = <module 'scipy.linalg' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/linalg/__init__.py'>
        np         = <module 'numpy' from '/home/runner/.local/lib/python3.8/site-packages/numpy/__init__.py'>
        scipy      = <module 'scipy' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/__init__.py'>
        von_mises_cdf = <built-in function von_mises_cdf>
        warnings   = <module 'warnings' from '/usr/lib/python3.8/warnings.py'>
__________________ test_kde_output_dtype[float128-silverman] ___________________
[gw0] linux -- Python 3.8.10 /usr/bin/python3.8-dbg
../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 'silverman'
        bw_type    = 'silverman'
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x7f4800f6c280>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
../testenv/lib/python3.8/site-packages/scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x7f4800f6c280>
        spec       = 'long double'
_stats.pyx:741: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'
        __builtins__ = <builtins>
        __doc__    = None
        __file__   = '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so'
        __loader__ = <_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>
        __name__   = 'scipy.stats._stats'
        __package__ = 'scipy.stats'
        __pyx_capi__ = {'_genhyperbolic_logpdf': <capsule object "double (double, void *)" at 0x7f4843fe4c20>, '_genhyperbolic_pdf': <capsule...at 0x7f4843fe49f0>, '_studentized_range_cdf': <capsule object "double (int, double *, void *)" at 0x7f4843fe4a40>, ...}
        __pyx_unpickle_Enum = <built-in function __pyx_unpickle_Enum>
        __spec__   = ModuleSpec(name='scipy.stats._stats', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f4843ad5b40>.../runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/stats/_stats.cpython-38d-x86_64-linux-gnu.so')
        __test__   = {}
        _center_distance_matrix = <built-in function _center_distance_matrix>
        _kendall_dis = <built-in function _kendall_dis>
        _local_correlations = <built-in function _local_correlations>
        _local_covariance = <built-in function _local_covariance>
        _rank_distance_matrix = <built-in function _rank_distance_matrix>
        _studentized_range_cdf_logconst = <built-in function _studentized_range_cdf_logconst>
        _studentized_range_pdf_logconst = <built-in function _studentized_range_pdf_logconst>
        _toint64   = <built-in function _toint64>
        _transform_distance_matrix = <built-in function _transform_distance_matrix>
        _weightedrankedtau = <cyfunction _weightedrankedtau at 0x7f4843e53050>
        gaussian_kernel_estimate = <cyfunction gaussian_kernel_estimate at 0x7f4843e53450>
        genhyperbolic_logpdf = <built-in function genhyperbolic_logpdf>
        genhyperbolic_pdf = <built-in function genhyperbolic_pdf>
        geninvgauss_logpdf = <built-in function geninvgauss_logpdf>
        linalg     = <module 'scipy.linalg' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/linalg/__init__.py'>
        np         = <module 'numpy' from '/home/runner/.local/lib/python3.8/site-packages/numpy/__init__.py'>
        scipy      = <module 'scipy' from '/home/runner/work/scipy/scipy/build/testenv/lib/python3.8/site-packages/scipy/__init__.py'>
        von_mises_cdf = <built-in function von_mises_cdf>
        warnings   = <module 'warnings' from '/usr/lib/python3.8/warnings.py'>
============================= slowest 10 durations =============================
35.68s call     build/testenv/lib/python3.8/site-packages/scipy/stats/tests/test_continuous_basic.py::test_kappa4_array_gh13[582](https://github.com/scipy/scipy/runs/7485588398?check_suite_focus=true#step:6:583)
25.11s call     build/testenv/lib/python3.8/site-packages/scipy/stats/tests/test_continuous_basic.py::test_cont_basic[500-200-skewnorm-arg91]
21.14s call     build/testenv/lib/python3.8/site-packages/scipy/_lib/tests/test_import_cycles.py::test_modules_importable
17.30s call     build/testenv/lib/python3.8/site-packages/scipy/optimize/tests/test_direct.py::TestDIRECT::test_segmentation_fault[False]
9.06s call     build/testenv/lib/python3.8/site-packages/scipy/stats/tests/test_continuous_basic.py::test_cont_basic[500-200-truncweibull_min-arg100]
8.39s call     build/testenv/lib/python3.8/site-packages/scipy/optimize/tests/test_optimize.py::TestOptimizeSimple::test_minimize_callback_copies_array[fmin]
7.70s call     build/testenv/lib/python3.8/site-packages/scipy/optimize/_trustregion_constr/tests/test_report.py::test_gh12[922](https://github.com/scipy/scipy/runs/7485588398?check_suite_focus=true#step:6:923)
6.91s call     build/testenv/lib/python3.8/site-packages/scipy/special/tests/test_cython_special.py::test_cython_api[elliprj]
6.03s call     build/testenv/lib/python3.8/site-packages/scipy/optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L4
5.78s call     build/testenv/lib/python3.8/site-packages/scipy/optimize/tests/test__differential_evolution.py::TestDifferentialEvolutionSolver::test_L1
=========================== short test summary info ============================
FAILED ../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py::test_kde_output_dtype[float128-float32]
FAILED ../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py::test_kde_output_dtype[float128-float64]
FAILED ../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py::test_kde_output_dtype[float128-float128]
FAILED ../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py::test_kde_output_dtype[float128-int32]
FAILED ../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py::test_kde_output_dtype[float128-int64]
FAILED ../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py::test_kde_output_dtype[float128-scott]
FAILED ../testenv/lib/python3.8/site-packages/scipy/stats/tests/test_kdeoth.py::test_kde_output_dtype[float128-silverman]

@mdhaber
Copy link
Contributor Author

mdhaber commented Jul 24, 2022

Rather than all the permutations required to replace whitening, I could undo gh-8558 and replace the original use of the precision matrix. It would make the code a lot easier to understand. But that would put a triangular solve in the for loop. So I think I'll just add comments in the code that explain how the code relates to the original idea.

@mdhaber
Copy link
Contributor Author

mdhaber commented Aug 6, 2022

@steppi if you also like linear algebra, this may be interesting to you.

@mdhaber mdhaber closed this Aug 9, 2022
@mdhaber mdhaber reopened this Aug 9, 2022
@steppi
Copy link
Contributor

steppi commented Aug 15, 2022

@steppi if you also like linear algebra, this may be interesting to you.

Putting this next in my queue.

Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to fix series of errors like:

___________________ test_kde_output_dtype[float128-float128] ___________________
[gw1] darwin -- Python 3.10.5 /Users/runner/miniconda3/envs/scipy-dev/bin/python
scipy/stats/tests/test_kdeoth.py:328: in test_kde_output_dtype
    result = k(points)
        bw         = 3.0
        bw_type    = <class 'numpy.float128'>
        dataset    = array([0., 1., 2., 3., 4.], dtype=float128)
        dtype      = <class 'numpy.float128'>
        k          = <scipy.stats._kde.gaussian_kde object at 0x140000160>
        points     = array([0., 1., 2., 3., 4.], dtype=float128)
        weights    = array([0., 1., 2., 3., 4.], dtype=float128)
scipy/stats/_kde.py:270: in evaluate
    result = gaussian_kernel_estimate[spec](
        d          = 1
        itemsize   = 16
        m          = 5
        output_dtype = <class 'numpy.float128'>
        points     = array([[0., 1., 2., 3., 4.]], dtype=float128)
        self       = <scipy.stats._kde.gaussian_kde object at 0x140000160>
        spec       = 'long double'
_stats.pyx:748: in scipy.stats._stats.gaussian_kernel_estimate
    ???
E   ValueError: Buffer dtype mismatch, expected 'long double' but got 'double'

scipy/stats/_stats.pyx Show resolved Hide resolved
scipy/stats/_kde.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_kdeoth.py Outdated Show resolved Hide resolved
scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
Copy link
Contributor

@steppi steppi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time to check everything, but things look OK mathematically so far. See the suggestion about adding some comments. I found the permutations a little inscrutable at first, so I think some comments to explain what's going on and link to more details would be helpful.

I should have time to complete my review next weekend.

Comment on lines +585 to +586
self._data_cho_cov = linalg.cholesky(
self._data_covariance[::-1, ::-1]).T[::-1, ::-1]
Copy link
Contributor

@steppi steppi Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's enough going on here that there should be some comments to explain things. It took me a bit to figure out what's happening.

Just to make I'm on the same page, here are the details as I understand them:

Let $\Gamma$ be the covariance matrix for the Gaussian kernel and $LL^{\top} = \Gamma^{-1}$ be the Cholesky decomposition of the precision matrix $\Gamma^{-1}$. Later on, we want to be able to transform $n \times d$ data matrices $X$ by multiplying on the right by the Cholesky factor $L$ for $\Gamma^{-1}$. Equivalently, we want to be able to find $Z$ such that $Z = XL$.

If we know $L^{-1}$ we can instead use a triangular solver to $ZL^{-1} = X$. It turns out we can calculate $L^{-1}$ with a Cholesky transform, but not of $\Gamma$, but instead a permuted version of $\Gamma$.

If $RR^{\top} = \Gamma$ then $(R^{-1})^{\top}R^{-1} = \Gamma^{-1}$. This isn't quite a Cholesky decomposition though, since $(R^{-1})^{\top}$ is upper triangular instead of lower triangular. Let $J$ be the antidiagonal matrix with all ones on the antidiagonal and zeros elsewhere. Multiplying on the left by $J$ permutes rows and multiplying on the right by $J$ permutes columns. If we instead calculate the Cholesky decomposition $RR^{\top} = J\Gamma J$, then
$$(R^{-1})^{\top}R^{-1} = J^{-1}\Gamma^{-1}J^{-1} = J\Gamma^{-1}J$$
and thus
$$\Gamma^{-1} = J(R^{-1})^\top R^{-1}J = (JR^{-1}J)^\top(JR^{-1}J)$$

where we've used that $J = J^{-1}$ and $J = J^{\top}$.

This means that if we know the Cholesky factor $R$ of $J\Gamma J$, the Cholesky factor of $\Gamma^{-1}$ is
$J(R^{-1})^{\top}J$. (If $A$ is upper triangular then $JAJ$ is lower triangular). This means $L^{-1} = JR^{\top}J$, and we can write our equation as $ZJR^{\top}J = X$ and solve for $Z$. This is exactly what you've done, but it's still not intuitive for me yet, just algebraic.

I think we should have a comment explaining the types of equations we want to be able to solve
$Z = XL$, where $L = \operatorname{Chol}\left(\Gamma^{-1}\right)$. Equivalently $ZL^{-1} = X$. Also a brief sentence explaining that it's possible to calculate $L^{-1}$ directly from the Cholesky decompostion of the permuted matrix that reverses both the rows and columns of $X$. Also some kind of citation to a place to find more details. The best explanation I've found is in a Mathoverflow answer by the eminent mathematician Robert Israel. Perhaps just a link to this answer would be good enough.

Copy link
Contributor Author

@mdhaber mdhaber Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely does deserve an explanation, and I meant to include one before you got to this. Sorry to make you find it on your own. Yes, I think that is the original post I followed. I'll write a bit about the motivation and link to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. The explanation is very clear now. I think everything is in good shape now but still want to double check carefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow the whole argument but if applies here, use the lower=False keyword for starting with an upper triangular in calling cholesky. Might save a column swap or two.

Copy link
Contributor

@steppi steppi Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow the whole argument but if applies here, use the lower=False keyword for starting with an upper triangular in calling cholesky. Might save a column swap or two.

It's a good thought but isn't quite what we want. According to the documentation for the underlying Lapack function, if $LL^{\top}$ is the Cholesky factorization of $\Gamma$, then
cholesky(Gamma, lower=True) will return L and cholesky(Gamma, lower=False) will return $L^{\top}$. I've tried it to be sure. What we would actually need is to find an upper triangular matrix $U$ such that $\Gamma = UU^{\top}$. The cholesky function can't do this so we're required to do the trick with reversing the rows and columns.

Copy link
Contributor

@steppi steppi Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but I guess it will save us one transpose when computing the Cholesky decomposition of $J\Gamma J$ since we need the upper triangular factor there. It may be worthwhile for small matrices but will most likely have a negligible impact.

Copy link
Contributor

@steppi steppi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Nice work. I suggested one more place that I think could use a comment but it’s fine if you think it isn’t needed.

@property
def inv_cov(self):
self.factor = self.covariance_factor()
self._data_covariance = atleast_2d(cov(self.dataset, rowvar=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we recompute self._data_covariance here? Does it help to maintain backwards compatibility for existing subclasses? If this is needed, could probably use a comment to explain why but it’s fine if you think it isn’t necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I assumed that since _compute_covariance used to re-calculate the covariance every time, there must have been a reason. Otherwise, why not do it just once in the __init__ method? As it was, it was re-calculated every time set_bandwidth was called, so maybe people use set_bandwidth to recalculate everything after modifying the public attribute dataset? I don't really know, but figured it would be safer this way.

And maybe subconsciously I want use of inv_cov to be as slow as possible : ) See the discussion in the original incarnation of this issue - #5087 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. That makes sense.

scipy/stats/_kde.py Show resolved Hide resolved
@property
def inv_cov(self):
self.factor = self.covariance_factor()
self._data_covariance = atleast_2d(cov(self.dataset, rowvar=1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I assumed that since _compute_covariance used to re-calculate the covariance every time, there must have been a reason. Otherwise, why not do it just once in the __init__ method? As it was, it was re-calculated every time set_bandwidth was called, so maybe people use set_bandwidth to recalculate everything after modifying the public attribute dataset? I don't really know, but figured it would be safer this way.

And maybe subconsciously I want use of inv_cov to be as slow as possible : ) See the discussion in the original incarnation of this issue - #5087 (comment).

@steppi steppi merged commit 00315a5 into scipy:main Aug 26, 2022
@mdhaber mdhaber added this to the 1.10.0 milestone Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants