Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add interleaved versions of phase/cartToPolar/polarToCart #3607

Open
wants to merge 8 commits into
base: 4.x
Choose a base branch
from

Conversation

chacha21
Copy link

This PR is for performance only (at the cost of more template code and increased GPU code size) The additional variants can help the caller skip the creation of temporary GPU mats (where memory is more likely to be a critical resource), and can even allow in-place processing. magnitude/angles/x/y are often already interleaved when dealing with DFTs.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

This PR is for performance only (at the cost of more template code and increased GPU code size)
The additional variants can help the caller skip the creation of temporary GPU mats (where memory is more likely to be a critical resource), and can even allow in-place processing.
magnitude/angles/x/y are often already interleaved when dealing with DFTs.
@asmorkalov
Copy link
Contributor

@cudawarped could you take a look?

@cudawarped
Copy link
Contributor

@cudawarped could you take a look?

Of course, but I may not have time before the release of 4.9.0.

additional "typename" disambiguifiers  are required by some compilers

GpuMat dst = getOutputMat(_dst, xy.size(), CV_32FC1, stream);

GpuMat_<float2> xyc(xy.reshape(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the existing functions use reshape to convert a GpuMat to a GpuMat_ before being passed to gridTransformxxx but I find this confusing because nothing is being reshaped, would it not be better to use globPtr<> directly on the GpuMats. e.g.

    if (angleInDegrees)
        gridTransformUnary(globPtr<float2>(xy), globPtr<float>(dst), direction_interleaved_func<float2, true>(), stream);

If so the existing routines could be updated to remove the bloat.

GpuMat mag = getOutputMat(_mag, xy.size(), CV_32FC1, stream);
GpuMat angle = getOutputMat(_angle, xy.size(), CV_32FC1, stream);

GpuMat_<float2> xyc(xy.reshape(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


GpuMat magAngle = getOutputMat(_magAngle, xy.size(), CV_32FC2, stream);

GpuMat_<float2> xyc(xy.reshape(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

@@ -192,6 +276,49 @@ namespace
ymat(y, x) = mag_val * sin_a;
}

template <typename T, bool useMag>
__global__ void polarToCartDstInterleavedImpl_(const GlobPtr<T> mag, const GlobPtr<T> angle, GlobPtr<typename MakeVec<T, 2>::type > xymat, const T scale, const int rows, const int cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use PtrStep<T> for mag, angle and xymat then you pass them directly to this function instead of using the intermediate GpuMat_<T> with reshape. Additionaly if angle is a PtrStepSz<T> then you don't need to pass cols and rows seperately. e.g.

    __global__ void polarToCartDstInterleavedImpl_(const PtrStep<T> mag, const PtrStepSz<T> angle, PtrStep<typename MakeVec<T, 2>::type > xymat, const T scale, const int rows, const int cols)
    {
        typedef typename MakeVec<T, 2>::type T2;
        const int x = blockDim.x * blockIdx.x + threadIdx.x;
        const int y = blockDim.y * blockIdx.y + threadIdx.y;

        if (x >= angle.cols || y >= angle.rows)
            return;

You can also make this adjustment to all the other polarToCart kernel calls including the existing one.

void polarToCartDstInterleavedImpl(const GpuMat& mag, const GpuMat& angle, GpuMat& xy, bool angleInDegrees, cudaStream_t& stream)
{
typedef typename MakeVec<T, 2>::type T2;
GpuMat_<T2> xyc(xy.reshape(2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you switch to PtrStep and PtrStepSz inside polarToCartDstInterleavedImpl_ then this can be simplified to

    template <typename T>
    void polarToCartDstInterleavedImpl(const GpuMat& mag, const GpuMat& angle, GpuMat& xy, bool angleInDegrees, cudaStream_t& stream)
    {
        const dim3 block(32, 8);
        const dim3 grid(divUp(angle.cols, block.x), divUp(angle.rows, block.y));

        const T scale = angleInDegrees ? static_cast<T>(CV_PI / 180.0) : static_cast<T>(1.0);

        if (mag.empty())
            polarToCartDstInterleavedImpl_<T, false> << <grid, block, 0, stream >> >(mag, angle, xy, scale, angle.rows, angle.cols);
        else
            polarToCartDstInterleavedImpl_<T, true> << <grid, block, 0, stream >> >(mag, angle, xy, scale, angle.rows, angle.cols);
    }


cv::cuda::GpuMat dstX1Y1 = createMat(size, CV_32FC1, useRoi);
cv::cuda::GpuMat dstXY2 = createMat(size, CV_32FC1, useRoi);
cv::cuda::phase(loadMat(x, useRoi), loadMat(y, useRoi), dstX1Y1, angleInDegrees);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a test case per function and compare the results to the CPU version it will make it easier to maintain going forward.

angle = randomMat(size, type);
cv::Mat magnitudeAngle;
cv::merge(magnitudeAngleChannels, 2, magnitudeAngle);
const double tol = (type == CV_32FC1 ? 1.6e-4 : 1e-4) * (angleInDegrees ? 1.0 : 19.47);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I would suggest a test case per function, comparing to cv::polarToCart especially if you are going to use these tolerances. In this funciton you could now be 2*tol away from the CPU result.

use globPtr() and PtrStepSz<> to bypass confusing reshape()
refactor tests
the "empty mag" feature is useless for interleaved case
get row/col size from angle mat rather than mag mat than could be empty in other cases
const int x = blockDim.x * blockIdx.x + threadIdx.x;
const int y = blockDim.y * blockIdx.y + threadIdx.y;

if (x >= xymat.cols || y >= xymat.rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to keep the out of range check consistent. I realize mag can be empty but your using xymat, angle and magAngle. Maybe stick with angle and magAngle.

Then you only need to use PtrStepSz for angle/magAngle the other inputs can be PtrStep.

GpuMat mag = getOutputMat(_mag, xy.size(), CV_32FC1, stream);
GpuMat angle = getOutputMat(_angle, xy.size(), CV_32FC1, stream);

GpuMat_<float> magc(mag.reshape(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the reshape completely, looking at it again it doesn't do anything? i.e.
GpuMat_ magc(mag);

@@ -2809,6 +2850,97 @@ INSTANTIATE_TEST_CASE_P(CUDA_Arithm, CartToPolar, testing::Combine(
testing::Values(AngleInDegrees(false), AngleInDegrees(true)),
WHOLE_SUBMAT));

PARAM_TEST_CASE(CartToPolarInterleaved1, cv::cuda::DeviceInfo, cv::Size, AngleInDegrees, UseRoi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give these test cases more informative names. Worst case scenario you could use CartToPolarInputInterleaved, CartToPolarInputOutputInterleaved, PolarToCartOutputInterleaved, PolarToCartInputOutputInterleaved if you can't think of anything better.

code style and simplifications
Copy link
Contributor

@cudawarped cudawarped left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
Passing tests on Windows 11, VS 2022 with CUDA 12.3.
@asmorkalov can you take a look.

@cudawarped
Copy link
Contributor

cudawarped commented Jan 9, 2024

@chacha21 You will need to squash and rebase this onto the tip of the 4.x branch as the CUDA CMake configuration has changed in the main repo since you submited this PR so I think it will fail on the CI.

@chacha21
Copy link
Author

chacha21 commented Jan 9, 2024

@chacha21 You will need to squash and rebase this onto the tip of the 4.x branch as the CUDA CMake configuration has changed in the main repo since you submited this PR.

Is this OK after "Merge branch '4.x'" ? My brain has never accepted git terminology, I am not sure about the good operation (with GitHub Desktop)

@asmorkalov
Copy link
Contributor

I rebased the patch to current 4.x and got build error with Cuda 11.8 and Ubuntu 18.04:

[ 14%] Processing OpenCL kernels (core)
/home/alexander/Projects/OpenCV/opencv_contrib/modules/cudev/include/opencv2/cudev/functional/detail/../functional.hpp(625): error: a class or namespace qualified name is required

/home/alexander/Projects/OpenCV/opencv_contrib/modules/cudev/include/opencv2/cudev/functional/detail/../functional.hpp(643): error: a class or namespace qualified name is required

/home/alexander/Projects/OpenCV/opencv_contrib/modules/cudev/include/opencv2/cudev/functional/detail/../functional.hpp(625): error: a class or namespace qualified name is required

/home/alexander/Projects/OpenCV/opencv_contrib/modules/cudev/include/opencv2/cudev/functional/detail/../functional.hpp(643): error: a class or namespace qualified name is required

2 errors detected in the compilation of "/home/alexander/Projects/OpenCV/opencv-master/modules/core/src/cuda/gpu_mat_nd.cu".
CMake Error at cuda_compile_1_generated_gpu_mat_nd.cu.o.Release.cmake:280 (message):
  Error generating file
  /home/alexander/Projects/OpenCV/opencv_contrib_build/modules/core/CMakeFiles/cuda_compile_1.dir/src/cuda/./cuda_compile_1_generated_gpu_mat_nd.cu.o


modules/core/CMakeFiles/opencv_core.dir/build.make:82: recipe for target 'modules/core/CMakeFiles/cuda_compile_1.dir/src/cuda/cuda_compile_1_generated_gpu_mat_nd.cu.o' failed
make[3]: *** [modules/core/CMakeFiles/cuda_compile_1.dir/src/cuda/cuda_compile_1_generated_gpu_mat_nd.cu.o] Error 1
make[3]: *** Ожидание завершения заданий…
2 errors detected in the compilation of "/home/alexander/Projects/OpenCV/opencv-master/modules/core/src/cuda/gpu_mat.cu".
CMake Error at cuda_compile_1_generated_gpu_mat.cu.o.Release.cmake:280 (message):
  Error generating file
  /home/alexander/Projects/OpenCV/opencv_contrib_build/modules/core/CMakeFiles/cuda_compile_1.dir/src/cuda/./cuda_compile_1_generated_gpu_mat.cu.o


modules/core/CMakeFiles/opencv_core.dir/build.make:75: recipe for target 'modules/core/CMakeFiles/cuda_compile_1.dir/src/cuda/cuda_compile_1_generated_gpu_mat.cu.o' failed

@chacha21
Copy link
Author

chacha21 commented May 2, 2024

I don't have such a problem with CUDA 12.4 under Visual Studio 2022
Could it be related to b330b6c ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants