Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Convert gaussian_kde logpdf to Cython #15493

Merged
merged 23 commits into from
Aug 29, 2022
Merged

Conversation

europeanplaice
Copy link
Contributor

Reference issue

What does this implement/fix?

In this PR, I converted gaussian_kde.logpdf to Cython. The code is based on gaussian_kernel_estimate so it may not be fully optimized. To take advantage of Cython, I avoided built-in functions, but for example, I added logsumexp manually and introduced nested loops. Feedback is welcome.

Additional information

You couldn't see a significant performance improvement on middle-sized datasets. However, as far as I know, this Cython code can run faster especially on large-scale datasets.
I am going to run more detailed benchmark tests.

@tupui tupui added Cython Issues with the internal Cython code base enhancement A new feature or improvement scipy.stats labels Jan 31, 2022
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @europeanplaice for the PR. To measure the performance, you can have a look at our benchmark system https://scipy.github.io/devdocs/dev/contributor/benchmarking.html#benchmarking-with-asv

@europeanplaice
Copy link
Contributor Author

@tupui Thank you for your advice but, currently, I can't build scipy in my environment. So I ran a benchmark test here.
https://colab.research.google.com/drive/1ZJwqlwW7n5DueNNaeOI1nWDf_4qt4_9K?usp=sharing

@europeanplaice
Copy link
Contributor Author

europeanplaice commented Feb 4, 2022

It became possible to build scipy in a local environment.
This is the benchmark test I ran with asv in my local environment. At least for my machine, we can see an improvement.

compare

scipy/stats/_stats.pyx Show resolved Hide resolved
scipy/stats/_kde.py Outdated Show resolved Hide resolved
@europeanplaice
Copy link
Contributor Author

Thank you @tupui. Please correct me if I am wrong.

scipy/scipy/stats/_kde.py

Lines 590 to 601 in a2b5304

points = atleast_2d(x)
d, m = points.shape
if d != self.d:
if d == 1 and m == self.d:
# points was passed in as a row vector
points = reshape(points, (self.d, 1))
m = 1
else:
msg = "points have dimension %s, dataset has dimension %s" % (d,
self.d)
raise ValueError(msg)

and
if xi.shape[1] != d:
raise ValueError("points and xi must have same trailing dim")

is the same tests. So the test in cython can be removed.
In my commit, I removed these lines.


Below also could be moved to logpdf but, it is useless because it is clear that the covariance's shape corresponds to the dataset's shape and, I can't come up with the other case. So I think it also can be removed.
In my commit, I removed these lines.

if precision.shape[0] != d or precision.shape[1] != d:
raise ValueError("precision matrix must match data dims")


Below, I can't validate these lines raise an error because np.common_type returns
<class 'numpy.float16'> itemsize = 2 or
<class 'numpy.float32'> itemsize = 4 or
<class 'numpy.float64'> itemsize = 8 or
<class 'numpy.float128'>itemsize = 16
and self.covariance must be float64 so I can't make invalid data that raise an error.
In my commit, I didn't touch it but didn't add tests.

scipy/scipy/stats/_kde.py

Lines 603 to 613 in a2b5304

output_dtype = np.common_type(self.covariance, points)
itemsize = np.dtype(output_dtype).itemsize
if itemsize == 4:
spec = 'float'
elif itemsize == 8:
spec = 'double'
elif itemsize in (12, 16):
spec = 'long double'
else:
raise TypeError('%s has unexpected item size %d' %
(output_dtype, itemsize))

scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
scipy/stats/_stats.pyx Show resolved Hide resolved
scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
scipy/stats/_stats.pyx Show resolved Hide resolved
scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
scipy/stats/_kde.py Outdated Show resolved Hide resolved
scipy/stats/_kde.py Outdated Show resolved Hide resolved
scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
scipy/stats/tests/test_kdeoth.py Outdated Show resolved Hide resolved
scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
@tupui tupui added this to the 1.9.0 milestone Apr 29, 2022
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @europeanplaice. LGTM, but it would be good if someone else could have a look. I added a milestone to hopefully get more attention.

Note: the benchmark should be refactored with a param instead of having 2 different set of tests for low/high number of points. A bit out of scope but can be a followup if you are interested.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't review the algorithmic details or Cython, but the Python side looks OK. I'm not very familiar with the gaussian_kde code, so is it already well tested for logpdf accuracy? If not, I'd add accuracy tests. Yes it is tested against pdf. I will also check whether kernel.logpdf(values) for the example in the docstring changes much. Update: maximum difference is 1.776e-15, no problem there. Speedup is a factor of ~2x (35.3 ms vs 84.8 ms.)

@tupui if these are tested to your satisfaction, feel free to merge unless you'd like to wait for a more thorough review from someone qualified to go through the algorithm line by line.

result[i] = logsumexp(0.5 * log_to_sum)

return result
raise ValueError(
Copy link
Contributor

@mdhaber mdhaber Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be tested? Or, if it is already tested, can we adda match to make sure the intended message is emitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially, it can be tested.
I copied it from evaluate method so that logpdf should be consistent with evaluate and I don't have any reason to write this code except for it.

np.common_type returns one of [_nx.half, _nx.single, _nx.double, _nx.longdouble, _nx.csingle, _nx.cdouble, _nx.clongdouble]. And its itemsize is also one of them

import numpy.core.numeric as _nx
print(np.dtype(_nx.half), np.dtype(_nx.half).itemsize)
print(np.dtype(_nx.single), np.dtype(_nx.single).itemsize)
print(np.dtype(_nx.double), np.dtype(_nx.double).itemsize)
print(np.dtype(_nx.longdouble), np.dtype(_nx.longdouble).itemsize)
print(np.dtype(_nx.csingle), np.dtype(_nx.csingle).itemsize)
print(np.dtype(_nx.cdouble), np.dtype(_nx.cdouble).itemsize)
print(np.dtype(_nx.clongdouble), np.dtype(_nx.clongdouble).itemsize)
float16 2
float32 4
float64 8
float128 16
complex64 8
complex128 16
complex256 32

It means if np.dtype is complex256 or float16 it raises an error.

For complex256 , the test would be like,

def test_pdf_logpdf_dtype_validation():
    xn = (np.random.random(10) + np.random.random(10) * 1j).astype("clongdouble")
    xs = (np.random.random(10) + np.random.random(10) * 1j).astype("clongdouble")
    gkde = stats.gaussian_kde(xn)
    msg = "<class 'numpy.complex256'> has unexpected item size 32"
    with pytest.raises(ValueError, match=msg):
        gkde.logpdf(xs)

But, for float16 it can't be tested. gkde.covariance is float64 so np.common_type returns float64 even if an array's dtype is float16.

scipy/stats/_kde.py Outdated Show resolved Hide resolved
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviews seem mostly positive here--the duplicated asv benchmark is maybe a bit awkward?


def time_gaussian_kde_logpdf_many_points(self):
# test gaussian_kde logpdf on many points
self.kernel.logpdf(self.positions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a single function named time_gaussian_kde_logpdf with time_gaussian_kde_logpdf.params = [self.positions, self.positions[:, :10]?

Like this: https://asv.readthedocs.io/en/stable/writing_benchmarks.html#parameterized-benchmarks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you and I agree with you. I added param to parameterize the benchmarks.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some changes that were needed. The bigger issue is in the Cython code however - the new code is mostly (80% - 90%) the same as the code from gaussian_kernel_estimate, with a few discrepancies that are not easy to understand. @europeanplaice it seems much better to me to either make this a single routine that deviates only when there's a log yes/no, or if that's not easily possible, then refactor the common parts out or template the routine (.pyx.in) so two copies of this routine come from the same source.

estimate = np.zeros((m, p), dtype)
zeros_max = np.zeros((p,), dtype) - np.inf
empty_estimate_ = np.empty((p, n), dtype)
for j in range(m):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop variables i and j are exactly reversed compared to gaussian_kernel_estimate, can you please make them the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I corrected the order of i and j, it would not work well because

# Calculate logsumexp
for k in range(p):
    for i in range(n):
        estimate[j, k] += math.exp(estimate_[k, i] - max_[k])
    estimate[j, k] = math.log(estimate[j, k])
    estimate[j, k] += max_[k]

this block depends on j. Currently, the loop logic is complicated so it's hard to make it align with gaussian_kernel_estimate. The code would be simple if we could separate logsumexp from this loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code would be simple if we could separate logsumexp from this loop.

I was going to suggest the same. Please do make a logsumexp function separate from this loop and add a simple unit test for it that compares it against special.logsumexp. I see you've already discussed np.log vs looping over math.log, so I won't ask for that to be changed, but we are already going to get a substantial speed boost from converting this to Cython, so I think it's fair to balance speed and readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made them the same.

scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
for k in range(p):
estimate_[k, i] = log_values_[i, k] + log_arg

# Collect max value for logsumexp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is going on here to be honest - can you explain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also has to do with estimate, estimate_ and empty_estimate_ - the copying and multiple arrays used there seem unnecessary. Why can this not be the same as for gaussian_kernel_estimate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what is going on here to be honest - can you explain?

When it returns

$$ \log(\sum^N_{i=1} \exp(x_i)) $$

I see an underflow error because the value of $x$ is extremely small.
To avoid this problem the formula should be changed as

$$ \begin{aligned} \log(\sum^N_{i=1} \exp(x_i)) &= \log{\exp(x_{max})\sum^N_{i=1} \exp(x_i - x_{max})} \\ & = \log{\sum^N_{i=1} \exp(x_i - x_{max})} + x_{max} \end{aligned} $$

By this conversion, the range of the target of $\exp$ is [xmin−xmax, 0] so, the possibility of an underflow can be reduced.

This function exists as logsumexp in scipy. However, scipy's logsumexp is implemented in Python, to keep the calculate fast, I wrote it by myself. In this code block you quoted, it tries to find $x_{max}$ .

In the calculation of $ log $ world, multiplication changes to addition.
gaussian_kernel_estimates'

 for k in range(p):
    estimate[j, k] += values_[i, k] * arg

corresponds to
gaussian_kernel_estimate_logs'

for k in range(p):
    estimate_[k, i] = log_values_[i, k] + log_arg

It also has to do with estimate, estimate_ and empty_estimate_ - the copying and multiple arrays used there seem unnecessary. Why can this not be the same as for gaussian_kernel_estimate?

For each m, I have to reset estimate_. If we didn't reset it, it would be possible that the previous iteration's max value would be selected.

I'm not sure my explanation has enough information. Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @europeanplaice, yes that makes sense. It looks like this would have been much easier if we had logsumexp implemented in C/Cython and exposed in special.cython_special, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It would be great not only for this PR, but also for future contributions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am just passing by to look for any PR involving Cython in scipy.

We do have a Cython implementation for logsumexp in scikit-learn. I guess you can reuse it.

On a side note, I am available for reviewing Cython PRs for scipy. So if needed, feel free to @ me. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jjerphan that's good to know 😃 💪

Copy link
Contributor

@jjerphan jjerphan Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a place where people could chase for reviewers. 😃

As for scikit-learn we are looking for reviewers, especially for Cython. So interproject support and communication are both nice and necessary for projects' heath and dynamic IMO, so here is my trial encouraging it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a place where a person would propose some reviewing expertise? I would even wider this and say, someone proposes a certain expertise. Sometimes we have features on our wish list that we don't know much, or have deep technical questions that do not need software eng. expertise at all.

That would be awesome. We could do something similar to Publon and think about how to motivate people to engage and do this. This is definitely something that Scientific Python think about.

cc @jarrodmillman

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created this thread on the Scientific Python Discourse to broaden the discussions regarding sharing skills across projects among the ecosystem.

@tylerjereddy
Copy link
Contributor

I'm going to bump the milestone here if that is alright. It sounds like Ralf has carved out a path forward for this PR at least.

@tylerjereddy tylerjereddy modified the milestones: 1.9.0, 1.10.0 May 29, 2022
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @europeanplaice. I think this is close. Just a few thoughts from me.

scipy/stats/_stats.pyx Show resolved Hide resolved
estimate = np.zeros((m, p), dtype)
zeros_max = np.zeros((p,), dtype) - np.inf
empty_estimate_ = np.empty((p, n), dtype)
for j in range(m):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code would be simple if we could separate logsumexp from this loop.

I was going to suggest the same. Please do make a logsumexp function separate from this loop and add a simple unit test for it that compares it against special.logsumexp. I see you've already discussed np.log vs looping over math.log, so I won't ask for that to be changed, but we are already going to get a substantial speed boost from converting this to Cython, so I think it's fair to balance speed and readability.

scipy/stats/_stats.pyx Outdated Show resolved Hide resolved
@europeanplaice
Copy link
Contributor Author

europeanplaice commented Aug 3, 2022

Though I haven't written a unit test, I made a cython version logsumexp. It is similar to https://github.com/scikit-learn/scikit-learn/blob/ef21ca1950f9570f2f6b1ce05d24111f78f9f612/sklearn/linear_model/_sag_fast.pyx.tp#L76-L95 (#15493 (comment))
The function I made can only accept an one dimensional array and doesn't have ability comparable to special.logsumexp. However, it runs as fast as the previous commit and the loop inside gaussian_kernel_estimate_log can be simpler than before.

@jjerphan
Copy link
Contributor

jjerphan commented Aug 3, 2022

Note that this PR overlaps on some portions with #16758.

Comment on lines +840 to +852
# Create the result array and evaluate the weighted sum
estimate = np.full((m, p), fill_value=-np.inf, dtype=dtype)
for i in range(n):
for j in range(m):
arg = 0
for k in range(d):
residual = (points_[i, k] - xi_[j, k])
arg += residual * residual

arg = -arg / 2 + log_norm
for k in range(p):
estimate[j, k] = logsumexp(estimate[j, k],
arg + log_values_[i, k])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this line-by-line equivalent with gaussian_kernel_estimate. If this resulted in a performance loss, please open a new PR to change it. I'd suggest getting a Cython-accessible logsumexp function addeded to scipy.special first, so it doesn't need to be part of this PR, and please demonstrate the difference in performance.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 26, 2022

@europeanplaice @jjerphan I edited things so that I'd be comfortable merging this. Please check that I did not make any huge mistakes, since I am not very fluent in Cython. If CI comes back green, I'll plan to merge on Monday.

@europeanplaice
Copy link
Contributor Author

I didn't find any mistake in the commit and it looks good to me. The code became simpler than before. Thank you.

@jjerphan
Copy link
Contributor

I won't be able to do a review in the coming days. Hence just move on without me if you need to.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 26, 2022

I'm that case I'll invite former reviewers @rgommers and @tupui to take a final look if they're interested. (If not, that's ok; I'm comfortable enough to merge if there are no more comments.)

scipy/stats/_kde.py Outdated Show resolved Hide resolved
@mdhaber mdhaber merged commit 8bcfdf7 into scipy:main Aug 29, 2022
@mdhaber
Copy link
Contributor

mdhaber commented Aug 29, 2022

Thanks @europeanplaice and reviewers! I'll add an entry to the release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython Issues with the internal Cython code base enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants