Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented fast median filter for CUDA using Wavelet Matrix, a constant-time, HDR-compatible method #3627

Merged
merged 6 commits into from May 22, 2024

Conversation

TumoiYorozu
Copy link
Contributor

@TumoiYorozu TumoiYorozu commented Jan 22, 2024

I replaced the existing CUDA implementation of the histogram-based median filter with an implementation of a new wavelet matrix-based median filter algorithm, which I presented at SIGGRAPH Asia 2022.
This paper won the Best Paper Award in the journal track of technical papers (ACM Transactions on Graphics).

This new algorithm, like the histogram method, has the property that the window radius does not affect the computation time, and is several times faster than the histogram method. Furthermore, while the histogram method does not support HDR and only supports 8U images, this new algorithm supports HDR and also supports 16U and 32F images.

I (the author) have published the implementation on my personal GitHub and made some modifications for OpenCV to make it accessible from OpenCV. I used the CUB library, which is part of the standard toolkit since CUDA 11.0. Therefore, depending on the CUDA_VERSION, the code is written to use the new algorithm for versions 11.0 and above, and the existing histogram method for versions 10 and below.

Regarding the old histogram-based code, the CPU version of the median filter supports 16U and 32F for window sizes up to 5, but it seems that the histogram CUDA version of the median filter does not. Also, the number of channels supported is different: the CPU version supports 1, 3, and 4 channels, while the CUDA version supports only 1 channel. In addition, for the CUDA version of the histogram method, pixels at the edges of the image, i.e. where the window is insufficient, were set to zero. For example, if the window size is 7, the width of the 3 pixels at the top, bottom, left, and right were not calculated correctly. When checking the tests, it was found that they compared with the CPU version by cropping the edges with rect, and also the cropping area was too wide, with 8 pixels cropped from the top, bottom, left, and right when the window size was 7.

In this PR, I first corrected the rect range for the tests so that both the old histogram method and the new wavelet matrix method can pass. Also, the CUDA version now supports 16U, 32F, and multi-channel formats such as 3 and 4 channels. In addition, while the CPU version only supports window sizes up to 5 for HDR, the new CUDA Wavelet Matrix method supports sizes of 7 and above. Additionally, I have added new tests for 16U, 32F, and multi-channel formats, specifically 3 and 4 channels.

Paper’s project page: Constant Time Median Filter using 2D Wavelet Matrix | Interactive Graphics & Engineering Lab
My implementation (as author): GitHub - TumoiYorozu/WMatrixMedian

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
    - [ ] There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

@TumoiYorozu TumoiYorozu changed the title Implemented fast median filter for CUDA using a constant-time, HDR-compatible Wavelet Matrix method Implemented fast median filter for CUDA using Wavelet Matrix, a constant-time, HDR-compatible method Jan 23, 2024
@TumoiYorozu
Copy link
Contributor Author

Here is a comparison of pre- and post-PR performance tests.
( I weakened the parameters to match the lower GPUs, so the speed improvement was weaker on the more recent stronger GPUs).

RTX3090, Core i9-9900X, Ubuntu 20.04, CUDA 12.3

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=24   mean=7.33   median=7.37   min=6.33   stddev=0.22 (3.0%))
[       OK ] Sz_KernelSz_Median.Median/20 (179 ms)
[----------] 21 tests from Sz_KernelSz_Median (2339 ms total)
[  PASSED  ] 21 tests.

(old) x1.98

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=10   mean=30.22   median=30.06   min=30.05   stddev=0.51 (1.7%))
[       OK ] Sz_KernelSz_Median.Median/20 (307 ms)
[----------] 21 tests from Sz_KernelSz_Median (4634 ms total)
[  PASSED  ] 21 tests.

Testa P100, Xeon E5-2650L v4, Ubuntu 18.04, CUDA 11.6

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=10   mean=19.17   median=19.32   min=17.63   stddev=0.55 (2.9%))
[       OK ] Sz_KernelSz_Median.Median/20 (198 ms)
[----------] 21 tests from Sz_KernelSz_Median (3619 ms total)
[  PASSED  ] 21 tests.

(old) x3.19

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=10   mean=76.36   median=76.25   min=76.23   stddev=0.35 (0.5%))
[       OK ] Sz_KernelSz_Median.Median/20 (777 ms)
[----------] 21 tests from Sz_KernelSz_Median (11548 ms total)
[  PASSED  ] 21 tests.

Tesra V100, Xeon E5-2695 v4, Ubuntu 18.04, CUDA 11.6

[ RUN      ] CUDA_Filters/Median.Accuracy/27, where GetParam() = (Tesla V100-SXM2-16GB, 113x113, 8UC1, KernelSize(15), sub matrix)
[       OK ] CUDA_Filters/Median.Accuracy/27 (2 ms)
[----------] 28 tests from CUDA_Filters/Median (805 ms total)
[  PASSED  ] 28 tests.

(old) x3.20

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=10   mean=7.94   median=7.93   min=7.77   stddev=0.10 (1.2%))
[       OK ] Sz_KernelSz_Median.Median/20 (82 ms)
[----------] 21 tests from Sz_KernelSz_Median (2583 ms total)
[  PASSED  ] 21 tests.

Tesra A100, Xeon Gold 6326, Ubuntu 20.04, CUDA 11.6

[ PERFSTAT ]    (samples=10   mean=40.84   median=40.53   min=40.52   stddev=0.98 (2.4%))
[       OK ] Sz_KernelSz_Median.Median/20 (411 ms)
[----------] 21 tests from Sz_KernelSz_Median (6701 ms total)
[  PASSED  ] 21 tests.

(old) x1.33

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=34   mean=8.85   median=8.90   min=7.41   stddev=0.26 (3.0%))
[       OK ] Sz_KernelSz_Median.Median/20 (304 ms)
[----------] 21 tests from Sz_KernelSz_Median (8881 ms total)
[  PASSED  ] 21 tests.

@TumoiYorozu
Copy link
Contributor Author

I also checked the build on Windows.

RTX A6000, i9-13900K, Windows 11, Cuda 11.6

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=13   mean=32.08   median=31.85   min=31.50   stddev=0.69 (2.2%))
[       OK ] Sz_KernelSz_Median.Median/20 (428 ms)
[  PASSED  ] 21 tests.

(old) x2.1

[ RUN      ] Sz_KernelSz_Median.Median/20, where GetParam() = (1920x1080, 15)
[ PERFSTAT ]    (samples=15   mean=13.54   median=13.58   min=12.83   stddev=0.40 (3.0%))
[       OK ] Sz_KernelSz_Median.Median/19 (206 ms)
[  PASSED  ] 21 tests.

@TumoiYorozu
Copy link
Contributor Author

TumoiYorozu commented Jan 24, 2024

Failed CI test, so I fixed the code.

error: calling a constexpr __host__ function("min") from a __global__ function("WaveletMatrixMultiCu4G_UpSweep_gpu") is not allowed. The experimental flag '--expt-relaxed-constexpr' can be used to allow this.
https://github.com/opencv/opencv_contrib/actions/runs/7618879917/job/20811504191?pr=3627

@TumoiYorozu
Copy link
Contributor Author

TumoiYorozu commented Jan 25, 2024

I found that older versions of the compiler did not support if constexpr, leading to compilation failures. For GCC, Clang, VC++, and Intel Compiler. I researched the versions that started supporting if constexpr, and verified that compilation was successful on platforms such as gcc.godbolt.org. During the checks, I used the -std=c++11 option to ensure successful compilation.

@TumoiYorozu
Copy link
Contributor Author

@opencv-alalek I have completed the code fixes, verified that the code passes build and test in various environments, and is ready to be merged. Would you please test it ?

@opencv-alalek
Copy link

/cc @cudawarped

@cudawarped
Copy link
Contributor

/cc @cudawarped

Thanks @opencv-alalek. I really would like to take a look at this but I'm not going to have time for several weeks or longer so its probably best to count me out on this one. Thank you.

@@ -703,6 +704,18 @@ INSTANTIATE_TEST_CASE_P(CUDA_Filters, Median, testing::Combine(
WHOLE_SUBMAT)
);

}} // namespace
#ifdef __OPENCV_USE_WAVELET_MATRIX_FOR_MEDIAN_FILTER_CUDA__

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets run these tests regardless of optimization support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your review.
For environments that are not supported by optimization support (e.g. CUDA 10 or lower), this test will always fail, is that still ok?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general API support should not depend on available optimizations. For such cases we usually have fallback generic code.

Do you mean that it fails on CV_Assert(srcType == CV_8UC1); assertion, right?

It makes sense to emit Error::StsNotImplemented error return instead of generic assertion.
I believe it is OK in CUDA cases to skip test if there is Error::StsNotImplemented error reaised.

@@ -0,0 +1,1011 @@
#ifndef __OPENCV_WAVELET_MATRIX_2D_CUH__

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add OpenCV license header in added files: https://github.com/opencv/opencv/wiki/Coding_Style_Guide#file-structure

// This file is part of OpenCV project.
// It is subject to the license terms in the LICENSE file found in the top-level directory
// of this distribution and at http://opencv.org/license.html.

If this code is borrowed from somewhere then please add original license.

@TumoiYorozu
Copy link
Contributor Author

@opencv-alalek I have made the requested changes (fixed the tests and added the license). Please check again at your convenience.

@asmorkalov asmorkalov merged commit e46ba34 into opencv:4.x May 22, 2024
10 checks passed
@TumoiYorozu
Copy link
Contributor Author

@asmorkalov Thank you for reviewing and merging the pull request. The code was very complex, but I appreciate your effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants