Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ndimage: log(n) implementation for 1D rank filter #20543

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ggkogan
Copy link

@ggkogan ggkogan commented Apr 20, 2024

@ggkogan ggkogan requested a review from rgommers as a code owner April 20, 2024 22:40
@github-actions github-actions bot added scipy.ndimage C/C++ Items related to the internal C/C++ code base Cython Issues with the internal Cython code base Meson Items related to the introduction of Meson as the new build system for SciPy enhancement A new feature or improvement labels Apr 20, 2024
@ggkogan ggkogan marked this pull request as draft April 21, 2024 06:50
@ggkogan ggkogan marked this pull request as ready for review April 21, 2024 07:01
@ggkogan
Copy link
Author

ggkogan commented Apr 21, 2024

I see many failed tests due to issues that seem to be not related to the current pull request. Is there anything I can/should do about it?

@rgommers
Copy link
Member

If test failures are really unrelated, you can simply ignore them. However, I had a quick look and all the ones I saw do look related (e.g. TestNdimageInterpolation)

@ggkogan
Copy link
Author

ggkogan commented Apr 21, 2024

You are right. I will correct it and resubmit.

@ggkogan ggkogan marked this pull request as draft April 21, 2024 08:10
@rgommers
Copy link
Member

I will correct it and resubmit.

Thanks. Just to make sure: you can push new commits to the same branch that you opened this PR from, and they will show up here. No need to open a second PR.

scipy/ndimage/meson.build Outdated Show resolved Hide resolved
@rgommers
Copy link
Member

I had a quick look at the code, and the heavy use of templating and fused types jumped out to me. It results in a too-large size of the new extension module, it's immediately almost the largest one in the ndimage module:

% ls -lh build/scipy/ndimage/*.so | awk '{print $5, $9}'
51K build/scipy/ndimage/_ctest.cpython-311-darwin.so
105K build/scipy/ndimage/_cytest.cpython-311-darwin.so
161K build/scipy/ndimage/_nd_image.cpython-311-darwin.so
372K build/scipy/ndimage/_ni_label.cpython-311-darwin.so
345K build/scipy/ndimage/_rank_filter_1d.cpython-311-darwin.so

Could this be limited to a subset of types? What does _nd_image.rank_filter support, and why would it be necessary to extend that already-supported set?

Started working on https://ideone.com/8VVEa, I optimized by [...]

The license of this code looks a little problematic. It has a broken link to something that (from the mit-license in the link adddress) indicates the author licensed it as MIT, but that is a bit inconclusive. Is there something more definite for this code? If not, it'd be good to contact the original author to try to clarify that this is really MIT-licensed code.

@lucascolley lucascolley changed the title ENH: rank filter for 1D cases log(n) complexity implementation ENH: ndimage.rank_filter: log(n) implementation for 1D cases Apr 21, 2024
@lucascolley lucascolley changed the title ENH: ndimage.rank_filter: log(n) implementation for 1D cases ENH: ndimage: log(n) implementation for 1D rank filter Apr 21, 2024
@rgommers
Copy link
Member

The performance improvement looks promising! If the scaling changed from N to log(N), the usual way to demonstrate that is to plot runtime as a function of kernel size (or input array size). E.g., like the graphs in https://github.com/rgommers/explore-array-computing/blob/master/explore_xnd.ipynb). I think the vertical axis of your plot is more or less like that, but a bit hard to read. Is it log(N) in kernel size, and is there any change for array input size or filter order?

@ggkogan
Copy link
Author

ggkogan commented Apr 21, 2024

E.g., like the graphs in

The vertical axis is the kernel size but the improvement for the median filter is better than for large/small ranks. I wanted to demonstrate both of the aspects. In general, I understand that you are going to perform independent testing so this was for initial illustration only. Thanks for the comment anyway. I will consider it in the future.

@ggkogan
Copy link
Author

ggkogan commented Apr 21, 2024

I will correct it and resubmit.

Thanks. Just to make sure: you can push new commits to the same branch that you opened this PR from, and they will show up here. No need to open a second PR.

I am sorry, it was 3am and my brain was totally crashed :)

@ggkogan
Copy link
Author

ggkogan commented Apr 21, 2024

I had a quick look at the code, and the heavy use of templating and fused types jumped out to me. It results in a too-large size of the new extension module, it's immediately almost the largest one in the ndimage module:

% # on macOS arm64
% ls -lh build/scipy/ndimage/*.so | awk '{print $5, $9}'
51K build/scipy/ndimage/_ctest.cpython-311-darwin.so
105K build/scipy/ndimage/_cytest.cpython-311-darwin.so
161K build/scipy/ndimage/_nd_image.cpython-311-darwin.so
372K build/scipy/ndimage/_ni_label.cpython-311-darwin.so
345K build/scipy/ndimage/_rank_filter_1d.cpython-311-darwin.so

Could this be limited to a subset of types? What does _nd_image.rank_filter support, and why would it be necessary to extend that already-supported set?

Started working on https://ideone.com/8VVEa, I optimized by [...]

The license of this code looks a little problematic. It has a broken link to something that (from the mit-license in the link adddress) indicates the author licensed it as MIT, but that is a bit inconclusive. Is there something more definite for this code? If not, it'd be good to contact the original author to try to clarify that this is really MIT-licensed code.

Concerning the license, the code contained in the referred link is also contained in Stackoverlow. Does it solve the problem?
https://stackoverflow.com/a/5971248/8443371
https://stackoverflow.com/a/5970314/8443371

@ggkogan ggkogan marked this pull request as ready for review April 21, 2024 14:12
@rgommers
Copy link
Member

Concerning the license, the code contained in the referred link is also contained in Stackoverlow. Does it solve the problem?

Unfortunately not. From http://scipy.github.io/devdocs/dev/core-dev/index.html#licensing: "For instance, code published on StackOverflow is covered by a CC-BY-SA license, which is not compatible due to the share-alike clause. These contributions cannot be accepted for inclusion in SciPy unless the original code author is willing to (re)license his/her code under the modified BSD (or compatible) license"

@ggkogan
Copy link
Author

ggkogan commented Apr 21, 2024

t the original aut

I think that I found the original and it seems ok. Please approve @rgommers

@rgommers
Copy link
Member

I think that I found the original and it seems ok. Please approve @rgommers

Nice digging. Agreed, that looks like the original and states clearly enough that it is MIT-licensed. So all good from that perspective.

@ggkogan
Copy link
Author

ggkogan commented Apr 23, 2024

I had a quick look at the code, and the heavy use of templating and fused types jumped out to me. It results in a too-large size of the new extension module, it's immediately almost the largest one in the ndimage module:

% # on macOS arm64:
% ls -lh build/scipy/ndimage/*.so | awk '{print $5, $9}'
51K build/scipy/ndimage/_ctest.cpython-311-darwin.so
105K build/scipy/ndimage/_cytest.cpython-311-darwin.so
161K build/scipy/ndimage/_nd_image.cpython-311-darwin.so
372K build/scipy/ndimage/_ni_label.cpython-311-darwin.so
345K build/scipy/ndimage/_rank_filter_1d.cpython-311-darwin.so

Could this be limited to a subset of types? What does _nd_image.rank_filter support, and why would it be necessary to extend that already-supported set?

I have reviewed this comment. In case I use the same Cython file, eliminating all the possible overhead by defining all variables' type in the file (all the functions) and defining the template in the cpp file as a function. In such a case, the size reduces to 190K. Currently, I reduced the usage of the fused types to the main types only (int64/float32/double), the rest of the types are casted to those types (leading to some instability in performance but keeping solid improvement over the current implementation).
On the other side, if I define all the variations of the function input types within the cpp file and use ctypes to import them, the .so file size is 40K.
Therefore, I suspect that Cython functionality as a glue, takes most of the size, regardless of the typing. I looked for other pyx files, within the submodule, which are used as a glue for cpp/c files considering using them and try to reduce the overhead. Unfortunately, I did not find any.

Two questions:

  1. Do you think I am missing something?
  2. Will using a direct API such as Numpy's solve it?

Also, I know that you are pretty busy and those are basic questions. Anyone else that can support me here or should I go solo/stackoverflow? Maybe I should be using other platform/try to involve other people in this PR?

@ev-br
Copy link
Member

ev-br commented Apr 24, 2024

hose are basic questions.

They aren't. Related things are under rather active discussion (see #20334 (comment) for a sample), and the jury is still out on what's the recommended way to wrap C/C++ kernels going forward. Up until very recently the standing recommendation was cython, just like you did. Maybe you are the perfect person to move this forward?

So I think you've at least two options, and none of them pretty :-).

@rgommers
Copy link
Member

Currently, I reduced the usage of the fused types to the main types only (int64/float32/double), the rest of the types are casted to those types (leading to some instability in performance but keeping solid improvement over the current implementation).

This is the thing we do pretty much everywhere, and is pretty standard and necessary. Cython adds overhead indeed, but not templating over a large set of types in C++ is .

I tested again, on Linux x86-64 this time:

$ # on Linux x86-64
$ python dev.py build -C-Dbuildtype=release
$ ls -lh build/scipy/ndimage/*.so | awk '{print $5, $9}'17K build/scipy/ndimage/_ctest.cpython-311-x86_64-linux-gnu.so
84K build/scipy/ndimage/_cytest.cpython-311-x86_64-linux-gnu.so
135K build/scipy/ndimage/_nd_image.cpython-311-x86_64-linux-gnu.so
394K build/scipy/ndimage/_ni_label.cpython-311-x86_64-linux-gnu.so
263K build/scipy/ndimage/_rank_filter_1d.cpython-311-x86_64-linux-gnu.so

Turning off release mode (our default python dev.py build):

1,9M build/scipy/ndimage/_ni_label.cpython-311-x86_64-linux-gnu.so
1,2M build/scipy/ndimage/_rank_filter_1d.cpython-311-x86_64-linux-gnu.so

263 kb is still surprisingly large after removal of most of the fused types usage. I'm actually a little puzzled by what is happening there. I did a quick test doubling the number of types under ctypedef fused numeric_t:, and it only increases the size by 31 kb (~12%). Reducing it to only double reduced the size to 241 kb, so only a ~10% reduction.

Changing def statements to cdef already helps more, it's down to 209 kb for only cdef rank_filter_1d_cpp_api and down to 161 kb for cdef rank_filter_1d. So it's probably useful to do all the input validation and selection of the correct type to call in a pure Python file, and only use 1-D memoryview in the Cython code rather than using numpy functions through the Python API. There really isn't much of a speed gain of doing that kind of thing in Cython.

Making rank_filter_1d_cpp_api work on a memoryview also avoids having it return a Python object (out_arr), and it can then be annotated noexcept nogil. cython -a may help show where Python is used in the .pyx file (see docs).

pybind11 or using the C API directly could be a nice comparison, but it looks to me like optimizing the Cython code first should be the next step. The C++ code here is clean/short enough that this should be doable in Cython I'd hope.

@ggkogan
Copy link
Author

ggkogan commented Apr 30, 2024

Hi,
I have implemented the linkage via the Numpy-C API. From my tests, there is no other way providing the same speed and file size:

  1. I have tested all the directions suggested to Cython and it is a dead-end with respect to file size.
  2. An experienced user of pybind11 indicated it creates smaller files but it is slower.

A direct (GCC with flags) compilation produces smaller .so than the compilation via meason.build. I assume it is related to some interference by Cython (I have seen that Cython appears in .so file name therefore I assume we are using it as the compilation engine).
Currently, the file size for the main three datatypes (float32, double, int64) is 23K for a compilation flag -O1 and 27K for the package default compilation. If I am using a direct compilation there is no file size variation. I assume that the addition in the file size is another Cython artifact - as far as I see, no boost in performance is seen.
I prefer to use the default package compilation, avoiding any surprises or additional maintenance in the future.

PS: any comments about the code (or something else) are more than welcome 🙏 @ev-br @rgommers or anyone else who reads it

@ggkogan ggkogan force-pushed the enh-rank-filt-1d branch 2 times, most recently from 7729ee0 to 8651e0b Compare April 30, 2024 15:20
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed experiments and explanation @ggkogan! Overall your conclusions sounds about right to me.

An experienced user of pybind11 indicated it creates smaller files but it is slower.

I think they meant function call overhead. Which is true indeed - but only relevant for small arrays, for large arrays (which is what we should mostly care about) that doesn't really matter.


Looks like this is down to 21 kb for a release build on Linux (and 87 kb with the python dev.py build defaults):

$ ls -lh build/scipy/ndimage/_rank_filter_1d.cpython-311-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 21K  3 mei 14:44 build/scipy/ndimage/_rank_filter_1d.cpython-311-x86_64-linux-gnu.so

That's ~12x smaller than the Cython version, which is great.

(I have seen that Cython appears in .so file name therefore I assume we are using it as the compilation engine

Not Cython, but CPython: cpython-311-x86_64-linux-gnu.so. The file extension contains info relevant to the ABI used.

I assume that the addition in the file size is another Cython artifact - as far as I see, no boost in performance is seen.

No, just depends on the optimization level etc. Release builds are -O3, dev.py builds are -O2 -g.

py3.extension_module('_rank_filter_1d',
'src/_rank_filter_1d.cpp',
install: true,
dependencies: np_dep,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a line here that is present for all other extension modules:

    link_args: version_link_args,

could you please add that?

origins)
if input.ndim == 1:
rank = int(rank)
origin = int(origin)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be integers. Why was this needed?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was similar to yours but I have added it after failing some unit-test. This was back when I used Cython as a glue and it marked that my input is np.int64 rather than int. I did not want to look for the origin of the problem as I did not want to modify anything unrelated to this pull request. I hoped someone would pay attention and it would be sufficiently visible to be corrected in future.
Anyway, I do not see it now (I assume that Numpy-C API is not sensitive to it) and therefore I removed the casting.

if mode == 6:
mode = 4
if mode == 5:
mode = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct indeed, as explained in the user guide: http://scipy.github.io/devdocs/tutorial/ndimage.html. It looks a bit out of place to do this here though. Not sure why it could not be done inside _extend_mode_to_code if the modes are actually identical.

mode = 4
if mode == 5:
mode = 1
casting_cond = input.dtype.name not in ['int64', 'float64', 'float32']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using the actual dtypes: input.dtype not in (np.int64, np.float64, np.float32)

casting_cond = input.dtype.name not in ['int64', 'float64', 'float32']
if casting_cond:
x = input.astype('int64')
x_out = np.empty_like(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like this should be using np.result_type to determine what the correct output type is. E.g., for other floating-point dtypes, float32 or float64 will be expected, not int64.

I optimized by restriction of cases and proper initialization,
also adapted for rank filter rather than the original median filter.
Allowed different boundary conditions.
Moved to C++ for polymorphism and added C-Numpy API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the comments on evolution of the code, and copy exactly the copyright comment that was in the original file?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok with you? I would not insist on the second line but some major changes have been made in did.

//Copyright (c) 2011 ashelly.myopenid.com under http://www.opensource.org/licenses/mit-license
//Modified in 2024 by Gideon Genadi Kogan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is fine with me.

Allowed different boundary conditions.
Moved to C++ for polymorphism and added C-Numpy API.
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style comment: could you please run clang-format over this file?

//creates new Mediator: to calculate `nItems` running rank.
Mediator* MediatorNew(int nItems, int rank)
{
Mediator* m = (Mediator*)malloc(sizeof(Mediator));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to use new instead of malloc, since this is C++ code now.

@gideonKogan
Copy link

No, just depends on the optimization level etc. Release builds are -O3, dev.py builds are -O2 -g.

for manual compilation, I did not see a file size change...

@gideonKogan
Copy link

gideonKogan commented May 5, 2024

@rgommers I have made all the required modifications, hopefully fitting the expectations :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C/C++ Items related to the internal C/C++ code base Cython Issues with the internal Cython code base enhancement A new feature or improvement Meson Items related to the introduction of Meson as the new build system for SciPy scipy.ndimage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: ndimage: 1D rank filter speed up
4 participants