ENH: Convert gaussian_kde logpdf to Cython #15493

europeanplaice · 2022-01-31T14:19:45Z

Reference issue

What does this implement/fix?

In this PR, I converted gaussian_kde.logpdf to Cython. The code is based on gaussian_kernel_estimate so it may not be fully optimized. To take advantage of Cython, I avoided built-in functions, but for example, I added logsumexp manually and introduced nested loops. Feedback is welcome.

Additional information

You couldn't see a significant performance improvement on middle-sized datasets. However, as far as I know, this Cython code can run faster especially on large-scale datasets.
I am going to run more detailed benchmark tests.

tupui

Thanks @europeanplaice for the PR. To measure the performance, you can have a look at our benchmark system https://scipy.github.io/devdocs/dev/contributor/benchmarking.html#benchmarking-with-asv

europeanplaice · 2022-02-01T13:47:20Z

@tupui Thank you for your advice but, currently, I can't build scipy in my environment. So I ran a benchmark test here.
https://colab.research.google.com/drive/1ZJwqlwW7n5DueNNaeOI1nWDf_4qt4_9K?usp=sharing

europeanplaice · 2022-02-04T14:08:37Z

It became possible to build scipy in a local environment.
This is the benchmark test I ran with asv in my local environment. At least for my machine, we can see an improvement.

scipy/stats/_stats.pyx

scipy/stats/_kde.py

europeanplaice · 2022-02-08T03:40:29Z

Thank you @tupui. Please correct me if I am wrong.

scipy/scipy/stats/_kde.py

Lines 590 to 601 in a2b5304

    
           points = atleast_2d(x) 
        
           d, m = points.shape 
        
           if d != self.d: 
        
               if d == 1 and m == self.d: 
        
                   # points was passed in as a row vector 
        
                   points = reshape(points, (self.d, 1)) 
        
                   m = 1 
        
               else: 
        
                   msg = "points have dimension %s, dataset has dimension %s" % (d, 
        
                       self.d) 
        
                   raise ValueError(msg)

and

scipy/scipy/stats/_stats.pyx

Lines 795 to 796 in a2b5304

    
           if xi.shape[1] != d: 
        
               raise ValueError("points and xi must have same trailing dim")

is the same tests. So the test in cython can be removed.
In my commit, I removed these lines.

Below also could be moved to logpdf but, it is useless because it is clear that the covariance's shape corresponds to the dataset's shape and, I can't come up with the other case. So I think it also can be removed.
In my commit, I removed these lines.

scipy/scipy/stats/_stats.pyx

Lines 797 to 798 in a2b5304

    
           if precision.shape[0] != d or precision.shape[1] != d: 
        
               raise ValueError("precision matrix must match data dims")

Below, I can't validate these lines raise an error because np.common_type returns
<class 'numpy.float16'> itemsize = 2 or
<class 'numpy.float32'> itemsize = 4 or
<class 'numpy.float64'> itemsize = 8 or
<class 'numpy.float128'>itemsize = 16
and self.covariance must be float64 so I can't make invalid data that raise an error.
In my commit, I didn't touch it but didn't add tests.

scipy/scipy/stats/_kde.py

Lines 603 to 613 in a2b5304

    
           output_dtype = np.common_type(self.covariance, points) 
        
           itemsize = np.dtype(output_dtype).itemsize 
        
           if itemsize == 4: 
        
               spec = 'float' 
        
           elif itemsize == 8: 
        
               spec = 'double' 
        
           elif itemsize in (12, 16): 
        
               spec = 'long double' 
        
           else: 
        
               raise TypeError('%s has unexpected item size %d' % 
        
                               (output_dtype, itemsize))

scipy/stats/_stats.pyx

scipy/stats/_kde.py

scipy/stats/_stats.pyx

scipy/stats/tests/test_kdeoth.py

Co-authored-by: Pamphile Roy <roy.pamphile@gmail.com>

scipy/stats/_stats.pyx

tupui

Thanks @europeanplaice. LGTM, but it would be good if someone else could have a look. I added a milestone to hopefully get more attention.

Note: the benchmark should be refactored with a param instead of having 2 different set of tests for low/high number of points. A bit out of scope but can be a followup if you are interested.

mdhaber

I can't review the algorithmic details or Cython, but the Python side looks OK. I'm not very familiar with the gaussian_kde code, so is it already well tested for logpdf accuracy? ~~If not, I'd add accuracy tests.~~ Yes it is tested against pdf. I will also check whether kernel.logpdf(values) for the example in the docstring changes much. Update: maximum difference is 1.776e-15, no problem there. Speedup is a factor of ~2x (35.3 ms vs 84.8 ms.)

@tupui if these are tested to your satisfaction, feel free to merge unless you'd like to wait for a more thorough review from someone qualified to go through the algorithm line by line.

mdhaber · 2022-04-29T18:27:32Z

scipy/stats/_kde.py

-                result[i] = logsumexp(0.5 * log_to_sum)
-
-        return result
+            raise ValueError(


Can this be tested? Or, if it is already tested, can we adda match to make sure the intended message is emitted?

Partially, it can be tested.
I copied it from evaluate method so that logpdf should be consistent with evaluate and I don't have any reason to write this code except for it.

np.common_type returns one of [_nx.half, _nx.single, _nx.double, _nx.longdouble, _nx.csingle, _nx.cdouble, _nx.clongdouble]. And its itemsize is also one of them

import numpy.core.numeric as _nx print(np.dtype(_nx.half), np.dtype(_nx.half).itemsize) print(np.dtype(_nx.single), np.dtype(_nx.single).itemsize) print(np.dtype(_nx.double), np.dtype(_nx.double).itemsize) print(np.dtype(_nx.longdouble), np.dtype(_nx.longdouble).itemsize) print(np.dtype(_nx.csingle), np.dtype(_nx.csingle).itemsize) print(np.dtype(_nx.cdouble), np.dtype(_nx.cdouble).itemsize) print(np.dtype(_nx.clongdouble), np.dtype(_nx.clongdouble).itemsize)

float16 2 float32 4 float64 8 float128 16 complex64 8 complex128 16 complex256 32

It means if np.dtype is complex256 or float16 it raises an error.

For complex256 , the test would be like,

def test_pdf_logpdf_dtype_validation(): xn = (np.random.random(10) + np.random.random(10) * 1j).astype("clongdouble") xs = (np.random.random(10) + np.random.random(10) * 1j).astype("clongdouble") gkde = stats.gaussian_kde(xn) msg = "<class 'numpy.complex256'> has unexpected item size 32" with pytest.raises(ValueError, match=msg): gkde.logpdf(xs)

But, for float16 it can't be tested. gkde.covariance is float64 so np.common_type returns float64 even if an array's dtype is float16.

scipy/stats/_kde.py

tylerjereddy

Reviews seem mostly positive here--the duplicated asv benchmark is maybe a bit awkward?

tylerjereddy · 2022-05-23T01:59:54Z

benchmarks/benchmarks/stats.py

+
+    def time_gaussian_kde_logpdf_many_points(self):
+        # test gaussian_kde logpdf on many points
+        self.kernel.logpdf(self.positions)


maybe a single function named time_gaussian_kde_logpdf with time_gaussian_kde_logpdf.params = [self.positions, self.positions[:, :10]?

Like this: https://asv.readthedocs.io/en/stable/writing_benchmarks.html#parameterized-benchmarks

Thank you and I agree with you. I added param to parameterize the benchmarks.

It's now ~100 ms, while it was close to a second (which is too much).

[ci skip]

rgommers

I pushed some changes that were needed. The bigger issue is in the Cython code however - the new code is mostly (80% - 90%) the same as the code from gaussian_kernel_estimate, with a few discrepancies that are not easy to understand. @europeanplaice it seems much better to me to either make this a single routine that deviates only when there's a log yes/no, or if that's not easily possible, then refactor the common parts out or template the routine (.pyx.in) so two copies of this routine come from the same source.

rgommers · 2022-05-28T18:30:58Z

scipy/stats/_stats.pyx

+    estimate = np.zeros((m, p), dtype)
+    zeros_max = np.zeros((p,), dtype) - np.inf
+    empty_estimate_ = np.empty((p, n), dtype)
+    for j in range(m):


The loop variables i and j are exactly reversed compared to gaussian_kernel_estimate, can you please make them the same?

If I corrected the order of i and j, it would not work well because

# Calculate logsumexp for k in range(p): for i in range(n): estimate[j, k] += math.exp(estimate_[k, i] - max_[k]) estimate[j, k] = math.log(estimate[j, k]) estimate[j, k] += max_[k]

this block depends on j. Currently, the loop logic is complicated so it's hard to make it align with gaussian_kernel_estimate. The code would be simple if we could separate logsumexp from this loop.

The code would be simple if we could separate logsumexp from this loop.

I was going to suggest the same. Please do make a logsumexp function separate from this loop and add a simple unit test for it that compares it against special.logsumexp. I see you've already discussed np.log vs looping over math.log, so I won't ask for that to be changed, but we are already going to get a substantial speed boost from converting this to Cython, so I think it's fair to balance speed and readability.

I made them the same.

scipy/stats/_stats.pyx

rgommers · 2022-05-28T18:33:42Z

scipy/stats/_stats.pyx

+            for k in range(p):
+                estimate_[k, i] = log_values_[i, k] + log_arg
+
+                # Collect max value for logsumexp


I don't understand what is going on here to be honest - can you explain?

It also has to do with estimate, estimate_ and empty_estimate_ - the copying and multiple arrays used there seem unnecessary. Why can this not be the same as for gaussian_kernel_estimate?

I don't understand what is going on here to be honest - can you explain?

When it returns

$$ \log(\sum^N_{i=1} \exp(x_i)) $$

I see an underflow error because the value of $x$ is extremely small.
To avoid this problem the formula should be changed as

$$ \begin{aligned} \log(\sum^N_{i=1} \exp(x_i)) &= \log{\exp(x_{max})\sum^N_{i=1} \exp(x_i - x_{max})} \\ & = \log{\sum^N_{i=1} \exp(x_i - x_{max})} + x_{max} \end{aligned} $$

By this conversion, the range of the target of $\exp$ is [xmin−xmax, 0] so, the possibility of an underflow can be reduced.

This function exists as logsumexp in scipy. However, scipy's logsumexp is implemented in Python, to keep the calculate fast, I wrote it by myself. In this code block you quoted, it tries to find $x_{max}$ .

In the calculation of $ log $ world, multiplication changes to addition.
gaussian_kernel_estimates'

for k in range(p): estimate[j, k] += values_[i, k] * arg

corresponds to
gaussian_kernel_estimate_logs'

for k in range(p): estimate_[k, i] = log_values_[i, k] + log_arg

It also has to do with estimate, estimate_ and empty_estimate_ - the copying and multiple arrays used there seem unnecessary. Why can this not be the same as for gaussian_kernel_estimate?

For each m, I have to reset estimate_. If we didn't reset it, it would be possible that the previous iteration's max value would be selected.

I'm not sure my explanation has enough information. Does it make sense?

Thanks @europeanplaice, yes that makes sense. It looks like this would have been much easier if we had logsumexp implemented in C/Cython and exposed in special.cython_special, right?

Yes. It would be great not only for this PR, but also for future contributions!

Hi, I am just passing by to look for any PR involving Cython in scipy.

We do have a Cython implementation for logsumexp in scikit-learn. I guess you can reuse it.

On a side note, I am available for reviewing Cython PRs for scipy. So if needed, feel free to @ me. :)

Thank you @jjerphan that's good to know 😃 💪

There should be a place where people could chase for reviewers. 😃

As for scikit-learn we are looking for reviewers, especially for Cython. So interproject support and communication are both nice and necessary for projects' heath and dynamic IMO, so here is my trial encouraging it.

Do you mean a place where a person would propose some reviewing expertise? I would even wider this and say, someone proposes a certain expertise. Sometimes we have features on our wish list that we don't know much, or have deep technical questions that do not need software eng. expertise at all.

That would be awesome. We could do something similar to Publon and think about how to motivate people to engage and do this. This is definitely something that Scientific Python think about.

cc @jarrodmillman

I've created this thread on the Scientific Python Discourse to broaden the discussions regarding sharing skills across projects among the ecosystem.

tylerjereddy · 2022-05-29T20:42:38Z

I'm going to bump the milestone here if that is alright. It sounds like Ralf has carved out a path forward for this PR at least.

scipy/stats/_stats.pyx

scipy/stats/_kde.py

scipy/stats/_stats.pyx

mdhaber

Thanks @europeanplaice. I think this is close. Just a few thoughts from me.

scipy/stats/_stats.pyx

mdhaber · 2022-07-27T15:55:52Z

scipy/stats/_stats.pyx

+    estimate = np.zeros((m, p), dtype)
+    zeros_max = np.zeros((p,), dtype) - np.inf
+    empty_estimate_ = np.empty((p, n), dtype)
+    for j in range(m):


The code would be simple if we could separate logsumexp from this loop.

I was going to suggest the same. Please do make a logsumexp function separate from this loop and add a simple unit test for it that compares it against special.logsumexp. I see you've already discussed np.log vs looping over math.log, so I won't ask for that to be changed, but we are already going to get a substantial speed boost from converting this to Cython, so I think it's fair to balance speed and readability.

scipy/stats/_stats.pyx

europeanplaice · 2022-08-03T03:36:48Z

Though I haven't written a unit test, I made a cython version logsumexp. It is similar to https://github.com/scikit-learn/scikit-learn/blob/ef21ca1950f9570f2f6b1ce05d24111f78f9f612/sklearn/linear_model/_sag_fast.pyx.tp#L76-L95 (#15493 (comment))
The function I made can only accept an one dimensional array and doesn't have ability comparable to special.logsumexp. However, it runs as fast as the previous commit and the loop inside gaussian_kernel_estimate_log can be simpler than before.

jjerphan · 2022-08-03T07:12:05Z

Note that this PR overlaps on some portions with #16758.

mdhaber · 2022-08-26T10:59:13Z

scipy/stats/_stats.pyx

+    # Create the result array and evaluate the weighted sum
+    estimate = np.full((m, p), fill_value=-np.inf, dtype=dtype)
+    for i in range(n):
+        for j in range(m):
+            arg = 0
+            for k in range(d):
+                residual = (points_[i, k] - xi_[j, k])
+                arg += residual * residual
+
+            arg = -arg / 2 + log_norm
+            for k in range(p):
+                estimate[j, k] = logsumexp(estimate[j, k],
+                                           arg + log_values_[i, k])


I made this line-by-line equivalent with gaussian_kernel_estimate. If this resulted in a performance loss, please open a new PR to change it. I'd suggest getting a Cython-accessible logsumexp function addeded to scipy.special first, so it doesn't need to be part of this PR, and please demonstrate the difference in performance.

mdhaber · 2022-08-26T11:04:16Z

@europeanplaice @jjerphan I edited things so that I'd be comfortable merging this. Please check that I did not make any huge mistakes, since I am not very fluent in Cython. If CI comes back green, I'll plan to merge on Monday.

europeanplaice · 2022-08-26T11:40:04Z

I didn't find any mistake in the commit and it looks good to me. The code became simpler than before. Thank you.

jjerphan · 2022-08-26T13:07:02Z

I won't be able to do a review in the coming days. Hence just move on without me if you need to.

mdhaber · 2022-08-26T19:29:00Z

I'm that case I'll invite former reviewers @rgommers and @tupui to take a final look if they're interested. (If not, that's ok; I'm comfortable enough to merge if there are no more comments.)

scipy/stats/_kde.py

mdhaber · 2022-08-29T14:26:15Z

Thanks @europeanplaice and reviewers! I'll add an entry to the release notes.

Convert kde logpdf to Cython

c1a1e06

tupui added Cython Issues with the internal Cython code base enhancement A new feature or improvement scipy.stats labels Jan 31, 2022

tupui reviewed Jan 31, 2022

View reviewed changes

europeanplaice and others added 2 commits February 1, 2022 12:37

Replace 99999999 with np.inf

1716d0a

Add benchmark

20bcb33

europeanplaice and others added 2 commits February 2, 2022 19:07

Replace np.log with nested loop

fdea423

fix typos

3604157

europeanplaice closed this Feb 4, 2022

europeanplaice reopened this Feb 4, 2022

Merge branch 'main' into kde

a2b5304

tupui reviewed Feb 7, 2022

View reviewed changes

scipy/stats/_stats.pyx Show resolved Hide resolved

scipy/stats/_kde.py Outdated Show resolved Hide resolved

Change some tests

206ba26

tupui reviewed Feb 10, 2022

View reviewed changes

europeanplaice and others added 4 commits February 11, 2022 15:39

Apply suggestions from code review

312f242

Co-authored-by: Pamphile Roy <roy.pamphile@gmail.com>

Reduce variables dimention in cython

cbffff2

Merge branch 'kde' of https://github.com/europeanplaice/scipy into kde

1d0c4cc

zeros to empty in estimate_

ccf299f

tupui reviewed Feb 12, 2022

View reviewed changes

scipy/stats/_stats.pyx Outdated Show resolved Hide resolved

Merge branch 'scipy:main' into kde

caee34c

europeanplaice requested a review from tupui April 29, 2022 09:55

tupui added this to the 1.9.0 milestone Apr 29, 2022

tupui approved these changes Apr 29, 2022

View reviewed changes

mdhaber approved these changes Apr 29, 2022

View reviewed changes

tylerjereddy reviewed May 23, 2022

View reviewed changes

parameterize benchmark

f7877b8

rgommers added 3 commits May 28, 2022 19:51

MAINT: refactor output dtype checks in pdf/logpdf

e9ad750

BENCH: make benchmark faster and add parameter name

6fa74e5

It's now ~100 ms, while it was close to a second (which is too much).

DOC: clean up docstring for Cythonized logpdf function

7151527

[ci skip]

rgommers requested changes May 28, 2022

View reviewed changes

tylerjereddy modified the milestones: 1.9.0, 1.10.0 May 29, 2022

mdhaber reviewed Jun 28, 2022

View reviewed changes

scipy/stats/_stats.pyx Show resolved Hide resolved

mdhaber reviewed Jun 28, 2022

View reviewed changes

scipy/stats/_kde.py Show resolved Hide resolved

mdhaber reviewed Jun 28, 2022

View reviewed changes

scipy/stats/_stats.pyx Show resolved Hide resolved

mdhaber reviewed Jun 28, 2022

View reviewed changes

scipy/stats/_stats.pyx Show resolved Hide resolved

mdhaber mentioned this pull request Jul 24, 2022

ENH: stats.gaussian_kde: replace use of inv_cov in pdf #16692

Merged

Remove log_arg

371f143

mdhaber reviewed Jul 27, 2022

View reviewed changes

Add cython logsumexp and simplify codes

b6a7e8e

mdhaber added 2 commits August 26, 2022 01:51

Merge remote-tracking branch 'upstream/main' into kde

39d8749

MAINT: stats.kde: simplify gaussian_kernel_estimate_log

b6ea3aa

mdhaber reviewed Aug 26, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into kde

615852a

mdhaber reviewed Aug 29, 2022

View reviewed changes

scipy/stats/_kde.py Outdated Show resolved Hide resolved

Apply suggestions from code review

d905a86

mdhaber merged commit 8bcfdf7 into scipy:main Aug 29, 2022

mdhaber mentioned this pull request Sep 8, 2022

ENH: stats.gaussian_kde: replace use of inv_cov in logpdf #16987

Merged

ENH: Convert gaussian_kde logpdf to Cython #15493

ENH: Convert gaussian_kde logpdf to Cython #15493

Conversation

europeanplaice commented Jan 31, 2022

Reference issue

What does this implement/fix?

Additional information

tupui left a comment

Choose a reason for hiding this comment

europeanplaice commented Feb 1, 2022

europeanplaice commented Feb 4, 2022 • edited

europeanplaice commented Feb 8, 2022

tupui left a comment

Choose a reason for hiding this comment

mdhaber left a comment • edited

Choose a reason for hiding this comment

mdhaber Apr 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerjereddy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

europeanplaice May 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjerphan Jun 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerjereddy commented May 29, 2022

mdhaber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

europeanplaice commented Aug 3, 2022 • edited

jjerphan commented Aug 3, 2022

Choose a reason for hiding this comment

mdhaber commented Aug 26, 2022 • edited

europeanplaice commented Aug 26, 2022

jjerphan commented Aug 26, 2022

mdhaber commented Aug 26, 2022

mdhaber commented Aug 29, 2022

europeanplaice commented Feb 4, 2022 •

edited

mdhaber left a comment •

edited

mdhaber Apr 29, 2022 •

edited

europeanplaice May 29, 2022 •

edited

jjerphan Jun 29, 2022 •

edited

europeanplaice commented Aug 3, 2022 •

edited

mdhaber commented Aug 26, 2022 •

edited