Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supports empty kernels in cuda::SeparableLinearFilters #3731

Open
wants to merge 4 commits into
base: 4.x
Choose a base branch
from

Conversation

chacha21
Copy link

@chacha21 chacha21 commented Apr 30, 2024

#25408

When only 1D convolution is needed (row or column filter only), cuda::LinearFilter might be slower than cuda::SeparableLinearFilter
Using cuda::SeparableLinearFilter for 1D convolution can be done by using a (1) kernel for the ignored dimension.
By supporting empty kernels in cuda::SeparableLinearFilter, there is no need for that (1) kernel any more.
Additionaly, the inner _buf used to store the intermediate convolution result can be saved when a single convolution is needed.

In "legacy" usage (row+col kernels), there is no regression in cuda::SeparableLinearFilter performance.
As soon as an empty kernel is used, the performance is largely increased.

Devil in the details : the "in-place" processing is supported and might need intermediate buf, but still no regression.

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

When only 1D convolution is needed (row or column filter only), `cuda::LinearFilter` is slower than `cuda::SeparableLinearFilter`
Using `cuda::SeparableLinearFilter` for 1D convolution can be tricked by using a `(1)` kernel for the ignored dimension
By supporting empty kernels in `cuda::SeparableLinearFilter`, there is no need for that `(1)` kernel any more
Additionaly, the inner `_buf ` used to store the intermediate convolutio result can be saved when a single convolution is needed

In "legacy" usage (row+col kernels), there is no regerssion in `cuda::SeparableLinearFilter` performance
As soon as an empty kernel is used, the performance is largely increased
The previous commit was incomplete and did not allow correct handling of non-CV_32F case
To save cuda code instanciation, the rowFilter is always srcType->CV_32F and the colFilter is CV_32F->dstType
Thus the intermediate buffer is always CV_32F
This is a little tricky when only a single kernel is used (either row or column), because src or dst adaptation might be needed.
@chacha21
Copy link
Author

chacha21 commented Apr 30, 2024

//size (2048x1024), 100x iterations
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_8U   dstType:CV_8U   (needsBuf : 0)
sepOrg:529.13 ms
sepNew:64.8515 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_8U   dstType:CV_16U  (needsBuf : 0)
sepOrg:529.524 ms
sepNew:95.3996 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_8U   dstType:CV_32F  (needsBuf : 0)
sepOrg:543.191 ms
sepNew:156.277 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_16U  dstType:CV_8U   (needsBuf : 0)
sepOrg:541.52 ms
sepNew:101.711 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_16U  dstType:CV_16U  (needsBuf : 0)
sepOrg:542.559 ms
sepNew:125.624 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_16U  dstType:CV_32F  (needsBuf : 0)
sepOrg:555.81 ms
sepNew:185.418 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_32F  dstType:CV_8U   (needsBuf : 0)
sepOrg:556.713 ms
sepNew:147.636 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_32F  dstType:CV_16U  (needsBuf : 0)
sepOrg:557.543 ms
sepNew:179.893 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:0  srcType:CV_32F  dstType:CV_32F  (needsBuf : 0)
sepOrg:571.227 ms
sepNew:246.701 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_8U   dstType:CV_8U   (needsBuf : 1)
sepOrg:540.822 ms
sepNew:457.014 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_8U   dstType:CV_16U  (needsBuf : 1)
sepOrg:544.813 ms
sepNew:463.19 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_8U   dstType:CV_32F  (needsBuf : 1)
sepOrg:554.78 ms
sepNew:476.576 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_16U  dstType:CV_8U   (needsBuf : 1)
sepOrg:552.229 ms
sepNew:484.566 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_16U  dstType:CV_16U  (needsBuf : 1)
sepOrg:559.432 ms
sepNew:492.91 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_16U  dstType:CV_32F  (needsBuf : 1)
sepOrg:567.785 ms
sepNew:506.038 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_32F  dstType:CV_8U   (needsBuf : 0)
sepOrg:568.706 ms
sepNew:307.84 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_32F  dstType:CV_16U  (needsBuf : 0)
sepOrg:573.123 ms
sepNew:312.824 ms
============================================
isInPlace:0     useRowKernel:0  useColKernel:1  srcType:CV_32F  dstType:CV_32F  (needsBuf : 0)
sepOrg:582.609 ms
sepNew:323.791 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_8U   dstType:CV_8U   (needsBuf : 1)
sepOrg:551.797 ms
sepNew:403.631 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_8U   dstType:CV_16U  (needsBuf : 1)
sepOrg:552.006 ms
sepNew:435.532 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_8U   dstType:CV_32F  (needsBuf : 0)
sepOrg:566.179 ms
sepNew:254.639 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_16U  dstType:CV_8U   (needsBuf : 1)
sepOrg:563.438 ms
sepNew:414.402 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_16U  dstType:CV_16U  (needsBuf : 1)
sepOrg:565.194 ms
sepNew:446.6 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_16U  dstType:CV_32F  (needsBuf : 0)
sepOrg:578.71 ms
sepNew:265.824 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_32F  dstType:CV_8U   (needsBuf : 1)
sepOrg:584.837 ms
sepNew:433.561 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_32F  dstType:CV_16U  (needsBuf : 1)
sepOrg:585.415 ms
sepNew:465.748 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:0  srcType:CV_32F  dstType:CV_32F  (needsBuf : 0)
sepOrg:598.477 ms
sepNew:284.285 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_8U   dstType:CV_8U   (needsBuf : 1)
sepOrg:563.67 ms
sepNew:563.476 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_8U   dstType:CV_16U  (needsBuf : 1)
sepOrg:568.284 ms!!!
sepNew:568.588 ms!!!
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_8U   dstType:CV_32F  (needsBuf : 1)
sepOrg:577.79 ms!!!
sepNew:577.947 ms!!!
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_16U  dstType:CV_8U   (needsBuf : 1)
sepOrg:575.168 ms
sepNew:574.481 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_16U  dstType:CV_16U  (needsBuf : 1)
sepOrg:581.33 ms
sepNew:579.709 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_16U  dstType:CV_32F  (needsBuf : 1)
sepOrg:591.171 ms
sepNew:589.734 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_32F  dstType:CV_8U   (needsBuf : 1)
sepOrg:597.099 ms
sepNew:595.435 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_32F  dstType:CV_16U  (needsBuf : 1)
sepOrg:601.671 ms
sepNew:599.812 ms
============================================
isInPlace:0     useRowKernel:1  useColKernel:1  srcType:CV_32F  dstType:CV_32F  (needsBuf : 1)
sepOrg:611.205 ms
sepNew:609.741 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_8U   dstType:CV_8U   (needsBuf : 0)
sepOrg:527.647 ms
sepNew:0.114411 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_8U   dstType:CV_16U  (needsBuf : 0)
sepOrg:529.202 ms
sepNew:95.1992 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_8U   dstType:CV_32F  (needsBuf : 0)
sepOrg:542.78 ms
sepNew:156.088 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_16U  dstType:CV_8U   (needsBuf : 0)
sepOrg:541.307 ms
sepNew:101.901 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_16U  dstType:CV_16U  (needsBuf : 0)
sepOrg:541.762 ms
sepNew:0.109288 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_16U  dstType:CV_32F  (needsBuf : 0)
sepOrg:555.668 ms
sepNew:184.811 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_32F  dstType:CV_8U   (needsBuf : 0)
sepOrg:556.58 ms
sepNew:147.596 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_32F  dstType:CV_16U  (needsBuf : 0)
sepOrg:557.355 ms
sepNew:179.582 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:0  srcType:CV_32F  dstType:CV_32F  (needsBuf : 0)
sepOrg:570.107 ms
sepNew:0.11555 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_8U   dstType:CV_8U   (needsBuf : 1)
sepOrg:539.966 ms
sepNew:456.471 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_8U   dstType:CV_16U  (needsBuf : 1)
sepOrg:544.997 ms
sepNew:463.23 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_8U   dstType:CV_32F  (needsBuf : 1)
sepOrg:554.855 ms
sepNew:476.561 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_16U  dstType:CV_8U   (needsBuf : 1)
sepOrg:552.478 ms
sepNew:484.727 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_16U  dstType:CV_16U  (needsBuf : 1)
sepOrg:557.832 ms
sepNew:490.721 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_16U  dstType:CV_32F  (needsBuf : 1)
sepOrg:568.171 ms
sepNew:505.864 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_32F  dstType:CV_8U   (needsBuf : 1)
sepOrg:568.683 ms
sepNew:307.768 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_32F  dstType:CV_16U  (needsBuf : 1)
sepOrg:573.375 ms
sepNew:313.126 ms
============================================
isInPlace:1     useRowKernel:0  useColKernel:1  srcType:CV_32F  dstType:CV_32F  (needsBuf : 1)
sepOrg:582.594 ms
sepNew:568.17 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_8U   dstType:CV_8U   (needsBuf : 1)
sepOrg:551.044 ms
sepNew:402.997 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_8U   dstType:CV_16U  (needsBuf : 1)
sepOrg:552.27 ms
sepNew:435.557 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_8U   dstType:CV_32F  (needsBuf : 1)
sepOrg:566.319 ms
sepNew:254.686 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_16U  dstType:CV_8U   (needsBuf : 1)
sepOrg:563.839 ms
sepNew:414.404 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_16U  dstType:CV_16U  (needsBuf : 1)
sepOrg:564.707 ms
sepNew:446.051 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_16U  dstType:CV_32F  (needsBuf : 1)
sepOrg:578.793 ms
sepNew:265.881 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_32F  dstType:CV_8U   (needsBuf : 1)
sepOrg:585.297 ms
sepNew:433.199 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_32F  dstType:CV_16U  (needsBuf : 1)
sepOrg:585.63 ms
sepNew:465.641 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:0  srcType:CV_32F  dstType:CV_32F  (needsBuf : 1)
sepOrg:598.339 ms
sepNew:531.541 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_8U   dstType:CV_8U   (needsBuf : 1)
sepOrg:562.944 ms
sepNew:562.755 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_8U   dstType:CV_16U  (needsBuf : 1)
sepOrg:568.359 ms!!!
sepNew:568.809 ms!!!
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_8U   dstType:CV_32F  (needsBuf : 1)
sepOrg:578.109 ms!!!
sepNew:578.299 ms!!!
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_16U  dstType:CV_8U   (needsBuf : 1)
sepOrg:575.254 ms
sepNew:574.528 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_16U  dstType:CV_16U  (needsBuf : 1)
sepOrg:580.567 ms
sepNew:579.043 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_16U  dstType:CV_32F  (needsBuf : 1)
sepOrg:591.259 ms
sepNew:589.886 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_32F  dstType:CV_8U   (needsBuf : 1)
sepOrg:597.409 ms
sepNew:595.706 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_32F  dstType:CV_16U  (needsBuf : 1)
sepOrg:601.568 ms
sepNew:599.995 ms
============================================
isInPlace:1     useRowKernel:1  useColKernel:1  srcType:CV_32F  dstType:CV_32F  (needsBuf : 1)
sepOrg:610.982 ms
sepNew:609.207 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant