Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core (opencl): CLBlast integration via dyanmic loading #25568

Open
wants to merge 7 commits into
base: 4.x
Choose a base branch
from

Conversation

fengyuentau
Copy link
Member

@fengyuentau fengyuentau commented May 10, 2024

Second commit is all about auto-generated code.

Usage

Get CLBlast:

git clone https://github.com/CNugteren/CLBlast
cmake -B build -S CLBlast -DCMAKE_INSTALL_PREFIX=build/install
cmake --build build --target install -j8

Test with this patch:

git clone https://github.com/fengyuentau/opencv
cd opencv
git checkout clblast_integration

export CLBLAST_INSTALL_DIR=/abs/path/to/CLBLAST-build/install
cmake -B build -DWITH_OPENCL=ON .
cmake --build build --target opencv_test_core opencv_perf_core -j8

export LD_LIBRARY_PATH=/abs/path/to/CLBLAST-build/install/lib # Use DYLD_LIBRARYPATH on macOS
./build/bin/opencv_test_core --gtest_filter="*OCL_*Gemm*"
./build/bin/opencv_perf_core --gtest_filter="*OCL_GemmFixture_Gemm*"

Performance

All perf results in a zip: perf.mali-g52+m1+uhd770+gtx1080ti.zip.

Usage example:

python opencv/modules/ts/misc/summary.py opencv_perf_core.gtx1080ti.xml opencv_perf_core.gtx1080ti.clblast.xml

Khadas VIM4 (8GB mem, 32GB disk space) with Mali G52 r1p0

Geometric mean (ms)

                        Name of Test                            opencv            opencv                opencv
                                                                 perf              perf                  perf
                                                             core.mali-g52 core.mali-g52.clblast core.mali-g52.clblast
                                                                                                          vs
                                                                                                        opencv
                                                                                                         perf
                                                                                                     core.mali-g52
                                                                                                      (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      40.127            24.210                 1.66
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      100.475           159.676                0.63
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               41.968            23.417                 1.79
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               103.093           100.620                1.02
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      43.569            24.015                 1.81
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      104.165           99.436                 1.05
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               42.239            25.108                 1.68
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               102.858           155.517                0.66
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      37.940            21.861                 1.74
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      95.145            149.887                0.63
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               37.004            21.203                 1.75
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               93.874            153.838                0.61
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    288.005           147.788                1.95
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                    778.214           591.137                1.32
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)             294.309           144.776                2.03
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)             784.807           588.121                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)    296.440           147.466                2.01
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)    790.240           590.712                1.34
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)             294.904           149.621                1.97
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)             786.937           595.259                1.32
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)    281.212           137.360                2.05
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)    760.520           572.009                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             278.671           135.431                2.06
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)             755.883           568.079                1.33

Macbook Air M1 (16GB mem, 512GB disk space)

Accuracy problem with scale >= 1280, but it is ok with scal = 1024.

Geometric mean (ms)

                        Name of Test                         opencv      opencv          opencv
                                                              perf        perf            perf
                                                             core.m1 core.m1.clblast core.m1.clblast
                                                                                           vs
                                                                                         opencv
                                                                                          perf
                                                                                         core.m1
                                                                                       (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                    2.756       3.033           0.91
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                   10.924      11.487           0.95
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)             4.238       3.738           1.13
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)            13.757      14.091           0.98
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)    4.396       3.320           1.32
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)   14.316      11.654           1.23
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)             4.287       3.525           1.22
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)            14.061      13.599           1.03
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)    4.502       3.955           1.14
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)   13.066      12.714           1.03
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)             3.938       3.896           1.01
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)            14.141      13.373           1.06
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                 34.337      failed             -
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                 128.817     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)          35.070      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)          131.373     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1) 35.882      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2) 132.787     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)          34.672      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)          131.903     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1) 35.527      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2) 132.244     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)          34.429      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)          131.407     failed             -

PC with i7-12700K (64GB mem, 1T disk space) with Intel(R) UHD Graphics 770

Accuracy problem with complex (type CV_32FC2).

Geometric mean (ms)

                        Name of Test                           opencv          opencv              opencv
                                                                perf            perf                perf
                                                             core.uhd770 core.uhd770.clblast core.uhd770.clblast
                                                                                                     vs
                                                                                                   opencv
                                                                                                    perf
                                                                                                 core.uhd770
                                                                                                 (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      1.177           1.840               0.64
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      9.703          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               1.526           1.591               0.96
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               9.783          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      3.836           1.869               2.05
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      9.914          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               1.526           2.050               0.74
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               9.805          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      1.533           2.103               0.73
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      9.821          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               1.180           1.850               0.64
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               9.737          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    9.267          11.262               0.82
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                   77.391          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)            12.185          10.151               1.20
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)            78.431          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)   30.303          11.136               2.72
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)   78.971          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)            10.979          12.519               0.88
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)            78.100          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)   11.008          12.471               0.88
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)   78.144          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             9.310          11.254               0.83
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)            77.435          failed                 -

PC with GTX 1080 Ti (12GB gpu mem, CUDA 12.3)

Geometric mean (ms)

                        Name of Test                             opencv             opencv                 opencv
                                                                  perf               perf                   perf
                                                             core.gtx1080ti core.gtx1080ti.clblast core.gtx1080ti.clblast
                                                                                                             vs
                                                                                                           opencv
                                                                                                            perf
                                                                                                       core.gtx1080ti
                                                                                                         (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                       0.338              0.307                   1.10
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                       0.654              0.480                   1.36
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)                0.432              0.306                   1.41
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)                0.819              0.483                   1.70
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)       0.505              0.285                   1.77
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)       0.916              0.520                   1.76
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)                0.431              0.292                   1.48
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)                0.821              0.497                   1.65
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)       0.398              0.296                   1.34
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)       0.690              0.499                   1.38
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)                0.338              0.308                   1.10
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)                0.656              0.483                   1.36
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                     2.018              1.395                   1.45
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                     4.259              3.759                   1.13
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)              2.270              0.969                   2.34
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)              4.830              3.071                   1.57
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)     2.555              1.200                   2.13
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)     5.287              3.536                   1.50
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)              2.376              1.359                   1.75
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)              4.855              3.892                   1.25
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)     2.375              1.364                   1.74
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)     4.852              3.916                   1.24
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)              2.121              1.181                   1.80
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)              4.506              3.547                   1.27

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake
force_builders=Linux OpenCL

@fengyuentau
Copy link
Member Author

Observed problems:

  1. On Intel i7-12700K with Intel(R) UHD Graphics 770: clblast has accuracy problem with complex (type CV_32FC2).
  2. on Apple M1: clblast has accuracy problem if scale >= 1280, but it is ok with scale = 1024.

Copy link
Contributor

@opencv-alalek opencv-alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic loading makes sense if there is strong API versioning in used project.

modules/core/include/opencv2/core/opencl/ocl_defs.hpp Outdated Show resolved Hide resolved
if(WITH_CLBLAST)
find_path(CLBLAST_INCLUDE_DIR
NAMES clblast_c.h
HINTS ENV CLBLAST_INSTALL_DIR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to always use "dynamic loading" then it makes sense to place this header into 3rdparty/include

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it may work but CLBlast version is bumped occassionally along with some tuned results for different devices. If we put a fixed header in 3rdparty/include, I guess it won't block the new library unless this file has significant changes right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is just no reason to use DIFFERENT versions of

  • header clblast_c.h
  • autogenerated files with function from another version of clblast_c.h

modules/core/src/matmul.dispatch.cpp Outdated Show resolved Hide resolved
modules/core/src/matmul.dispatch.cpp Outdated Show resolved Hide resolved
modules/core/src/matmul.dispatch.cpp Outdated Show resolved Hide resolved
modules/core/src/matmul.dispatch.cpp Show resolved Hide resolved
(const cl_mem)B.handle(ACCESS_READ), offsetB, ldb,
(float)beta,
(cl_mem)D.handle(ACCESS_RW), offsetC, ldc,
&queue, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is about async processing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean async processing several calls to Sgemm for example?

modules/core/src/ocl.cpp Outdated Show resolved Hide resolved
modules/core/src/opencl/runtime/opencl_clblast.cpp Outdated Show resolved Hide resolved
@@ -415,6 +415,9 @@ OCV_OPTION(WITH_OPENCLAMDFFT "Include AMD OpenCL FFT library support" ON
OCV_OPTION(WITH_OPENCLAMDBLAS "Include AMD OpenCL BLAS library support" ON
VISIBLE_IF NOT ANDROID AND NOT IOS AND NOT XROS AND NOT WINRT
VERIFY HAVE_CLAMDBLAS)
OCV_OPTION(WITH_CLBLAST "Include CLBlast library support" ON
VISIBLE_IF TRUE
VERIFY HAVE_CLBLAST)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VERIFY

This would break existed build configurations with ENABLE_CONFIG_VERIFICATION

/cc @mshabunin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked build with clblast and verification, it works fine:

cmake \
        -DCMAKE_INSTALL_PREFIX=install \
        -DWITH_QT=ON \
        -DWITH_1394=OFF \
        -DWITH_JASPER=OFF \
        -DWITH_OPENCLAMDFFT=OFF \
        -DWITH_OPENCLAMDBLAS=OFF \
        -DWITH_LAPACK=OFF \
        -DWITH_CLBLAST=ON \
        -DENABLE_CONFIG_VERIFICATION=ON \
        ../opencv

...

--   OpenCL:                        YES (CLBlast INTELVA)
--     Include path:                /work/opencv/3rdparty/include/opencl/1.2 /usr/include
--     Link libraries:              Dynamic load

...

-- Verifying WITH_CLBLAST=ON => 'HAVE_CLBLAST'=TRUE

...

BTW, I observe several identical warnings in matmul.dispatch.cpp (GCC 11, Ubuntu 22):

/opencv/modules/core/src/matmul.dispatch.cpp:147:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  147 |                               (const cl_mem)A.handle(ACCESS_READ), offsetA, lda,

...

/opencv/modules/core/src/matmul.dispatch.cpp:184:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  184 |                               (const cl_mem)B.handle(ACCESS_READ), offsetB, ldb,
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I observe several identical warnings in matmul.dispatch.cpp (GCC 11, Ubuntu 22):

/opencv/modules/core/src/matmul.dispatch.cpp:147:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  147 |                               (const cl_mem)A.handle(ACCESS_READ), offsetA, lda,

...

/opencv/modules/core/src/matmul.dispatch.cpp:184:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers]
  184 |                               (const cl_mem)B.handle(ACCESS_READ), offsetB, ldb,
      |                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Warnings are resolved by removing const.

@vpisarev
Copy link
Contributor

@fengyuentau, from the patch I can conclude that we need only a small portion of clblast. Can we extract a subset of clblast and put it to opencv/3rdparty and link it to OpenCV? (i.e. don't use dynamic loading, which is much less convenient for end users). Also, I believe, we need to solve problems with mac and intel somehow. I remember you said (and also see it from the performance charts) that the current Intel version of gemm in OpenCV is faster than clblast, maybe we should keep Intel version.

@asmorkalov
Copy link
Contributor

@fengyuentau Thanks a lot for the effort! The PR was discussed on OpenCV Core team meeting and conclusion is the following:

  • Do not implement dynamic dependency for now.
  • Use find_package or alternative to find CLBlast as dependency and build against external library instance.
  • Do not put it to 3rdparty for now as soon as we have troubles with the most popular platforms: Intel and Apple ARM.

@fengyuentau
Copy link
Member Author

we have troubles with the most popular platforms: Intel and Apple ARM.

I have done several testings on the clblast accuracy problem. It turns out clblast with tuning results on these platform gives incorrect results, and after reverting those tuning results it can give the correct results. See my repo for testing: https://github.com/fengyuentau/test-clblast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants