core (opencl): CLBlast integration via dyanmic loading #25568

fengyuentau · 2024-05-10T09:37:56Z

Second commit is all about auto-generated code.

Usage

Get CLBlast:

git clone https://github.com/CNugteren/CLBlast
cmake -B build -S CLBlast -DCMAKE_INSTALL_PREFIX=build/install
cmake --build build --target install -j8

Test with this patch:

git clone https://github.com/fengyuentau/opencv
cd opencv
git checkout clblast_integration

export CLBLAST_INSTALL_DIR=/abs/path/to/CLBLAST-build/install
cmake -B build -DWITH_OPENCL=ON .
cmake --build build --target opencv_test_core opencv_perf_core -j8

export LD_LIBRARY_PATH=/abs/path/to/CLBLAST-build/install/lib # Use DYLD_LIBRARYPATH on macOS
./build/bin/opencv_test_core --gtest_filter="*OCL_*Gemm*"
./build/bin/opencv_perf_core --gtest_filter="*OCL_GemmFixture_Gemm*"

Performance

All perf results in a zip: perf.mali-g52+m1+uhd770+gtx1080ti.zip.

Usage example:

python opencv/modules/ts/misc/summary.py opencv_perf_core.gtx1080ti.xml opencv_perf_core.gtx1080ti.clblast.xml

Khadas VIM4 (8GB mem, 32GB disk space) with Mali G52 r1p0

Geometric mean (ms)

                        Name of Test                            opencv            opencv                opencv
                                                                 perf              perf                  perf
                                                             core.mali-g52 core.mali-g52.clblast core.mali-g52.clblast
                                                                                                          vs
                                                                                                        opencv
                                                                                                         perf
                                                                                                     core.mali-g52
                                                                                                      (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      40.127            24.210                 1.66
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      100.475           159.676                0.63
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               41.968            23.417                 1.79
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               103.093           100.620                1.02
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      43.569            24.015                 1.81
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      104.165           99.436                 1.05
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               42.239            25.108                 1.68
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               102.858           155.517                0.66
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      37.940            21.861                 1.74
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      95.145            149.887                0.63
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               37.004            21.203                 1.75
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               93.874            153.838                0.61
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    288.005           147.788                1.95
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                    778.214           591.137                1.32
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)             294.309           144.776                2.03
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)             784.807           588.121                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)    296.440           147.466                2.01
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)    790.240           590.712                1.34
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)             294.904           149.621                1.97
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)             786.937           595.259                1.32
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)    281.212           137.360                2.05
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)    760.520           572.009                1.33
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             278.671           135.431                2.06
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)             755.883           568.079                1.33

Macbook Air M1 (16GB mem, 512GB disk space)

Accuracy problem with scale >= 1280, but it is ok with scal = 1024.

Geometric mean (ms)

                        Name of Test                         opencv      opencv          opencv
                                                              perf        perf            perf
                                                             core.m1 core.m1.clblast core.m1.clblast
                                                                                           vs
                                                                                         opencv
                                                                                          perf
                                                                                         core.m1
                                                                                       (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                    2.756       3.033           0.91
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                   10.924      11.487           0.95
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)             4.238       3.738           1.13
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)            13.757      14.091           0.98
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)    4.396       3.320           1.32
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)   14.316      11.654           1.23
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)             4.287       3.525           1.22
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)            14.061      13.599           1.03
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)    4.502       3.955           1.14
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)   13.066      12.714           1.03
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)             3.938       3.896           1.01
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)            14.141      13.373           1.06
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                 34.337      failed             -
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                 128.817     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)          35.070      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)          131.373     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1) 35.882      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2) 132.787     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)          34.672      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)          131.903     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1) 35.527      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2) 132.244     failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)          34.429      failed             -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)          131.407     failed             -

PC with i7-12700K (64GB mem, 1T disk space) with Intel(R) UHD Graphics 770

Accuracy problem with complex (type CV_32FC2).

Geometric mean (ms)

                        Name of Test                           opencv          opencv              opencv
                                                                perf            perf                perf
                                                             core.uhd770 core.uhd770.clblast core.uhd770.clblast
                                                                                                     vs
                                                                                                   opencv
                                                                                                    perf
                                                                                                 core.uhd770
                                                                                                 (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                      1.177           1.840               0.64
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                      9.703          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)               1.526           1.591               0.96
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)               9.783          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)      3.836           1.869               2.05
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)      9.914          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)               1.526           2.050               0.74
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)               9.805          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)      1.533           2.103               0.73
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)      9.821          failed                 -
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)               1.180           1.850               0.64
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)               9.737          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                    9.267          11.262               0.82
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                   77.391          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)            12.185          10.151               1.20
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)            78.431          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)   30.303          11.136               2.72
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)   78.971          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)            10.979          12.519               0.88
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)            78.100          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)   11.008          12.471               0.88
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)   78.144          failed                 -
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)             9.310          11.254               0.83
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)            77.435          failed                 -

PC with GTX 1080 Ti (12GB gpu mem, CUDA 12.3)

Geometric mean (ms)

                        Name of Test                             opencv             opencv                 opencv
                                                                  perf               perf                   perf
                                                             core.gtx1080ti core.gtx1080ti.clblast core.gtx1080ti.clblast
                                                                                                             vs
                                                                                                           opencv
                                                                                                            perf
                                                                                                       core.gtx1080ti
                                                                                                         (x-factor)
Gemm::OCL_GemmFixture::(640x640, 0, 32FC1)                       0.338              0.307                   1.10
Gemm::OCL_GemmFixture::(640x640, 0, 32FC2)                       0.654              0.480                   1.36
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC1)                0.432              0.306                   1.41
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T, 32FC2)                0.819              0.483                   1.70
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC1)       0.505              0.285                   1.77
Gemm::OCL_GemmFixture::(640x640, GEMM_1_T|GEMM_2_T, 32FC2)       0.916              0.520                   1.76
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC1)                0.431              0.292                   1.48
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T, 32FC2)                0.821              0.497                   1.65
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC1)       0.398              0.296                   1.34
Gemm::OCL_GemmFixture::(640x640, GEMM_2_T|GEMM_3_T, 32FC2)       0.690              0.499                   1.38
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC1)                0.338              0.308                   1.10
Gemm::OCL_GemmFixture::(640x640, GEMM_3_T, 32FC2)                0.656              0.483                   1.36
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC1)                     2.018              1.395                   1.45
Gemm::OCL_GemmFixture::(1280x1280, 0, 32FC2)                     4.259              3.759                   1.13
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC1)              2.270              0.969                   2.34
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T, 32FC2)              4.830              3.071                   1.57
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC1)     2.555              1.200                   2.13
Gemm::OCL_GemmFixture::(1280x1280, GEMM_1_T|GEMM_2_T, 32FC2)     5.287              3.536                   1.50
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC1)              2.376              1.359                   1.75
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T, 32FC2)              4.855              3.892                   1.25
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC1)     2.375              1.364                   1.74
Gemm::OCL_GemmFixture::(1280x1280, GEMM_2_T|GEMM_3_T, 32FC2)     4.852              3.916                   1.24
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC1)              2.121              1.181                   1.80
Gemm::OCL_GemmFixture::(1280x1280, GEMM_3_T, 32FC2)              4.506              3.547                   1.27

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Linux OpenCL

fengyuentau · 2024-05-11T09:20:11Z

Observed problems:

On Intel i7-12700K with Intel(R) UHD Graphics 770: clblast has accuracy problem with complex (type CV_32FC2).
on Apple M1: clblast has accuracy problem if scale >= 1280, but it is ok with scale = 1024.

opencv-alalek

Dynamic loading makes sense if there is strong API versioning in used project.

modules/core/include/opencv2/core/opencl/ocl_defs.hpp

opencv-alalek · 2024-05-13T11:12:14Z

cmake/OpenCVDetectOpenCL.cmake

+  if(WITH_CLBLAST)
+    find_path(CLBLAST_INCLUDE_DIR
+              NAMES clblast_c.h
+              HINTS ENV CLBLAST_INSTALL_DIR


If we want to always use "dynamic loading" then it makes sense to place this header into 3rdparty/include

Yes, it may work but CLBlast version is bumped occassionally along with some tuned results for different devices. If we put a fixed header in 3rdparty/include, I guess it won't block the new library unless this file has significant changes right?

There is just no reason to use DIFFERENT versions of

header clblast_c.h

autogenerated files with function from another version of clblast_c.h

modules/core/src/matmul.dispatch.cpp

opencv-alalek · 2024-05-13T11:21:25Z

modules/core/src/matmul.dispatch.cpp

+                              (const cl_mem)B.handle(ACCESS_READ), offsetB, ldb,
+                              (float)beta,
+                              (cl_mem)D.handle(ACCESS_RW), offsetC, ldc,
+                              &queue, NULL);


What is about async processing?

Do you mean async processing several calls to Sgemm for example?

modules/core/src/ocl.cpp

modules/core/src/opencl/runtime/opencl_clblast.cpp

opencv-alalek · 2024-05-13T11:30:19Z

CMakeLists.txt

@@ -415,6 +415,9 @@ OCV_OPTION(WITH_OPENCLAMDFFT "Include AMD OpenCL FFT library support" ON
 OCV_OPTION(WITH_OPENCLAMDBLAS "Include AMD OpenCL BLAS library support" ON
  VISIBLE_IF NOT ANDROID AND NOT IOS AND NOT XROS AND NOT WINRT
  VERIFY HAVE_CLAMDBLAS)
+OCV_OPTION(WITH_CLBLAST "Include CLBlast library support" ON
+  VISIBLE_IF TRUE
+  VERIFY HAVE_CLBLAST)


VERIFY

This would break existed build configurations with ENABLE_CONFIG_VERIFICATION

/cc @mshabunin

I've checked build with clblast and verification, it works fine:

cmake \ -DCMAKE_INSTALL_PREFIX=install \ -DWITH_QT=ON \ -DWITH_1394=OFF \ -DWITH_JASPER=OFF \ -DWITH_OPENCLAMDFFT=OFF \ -DWITH_OPENCLAMDBLAS=OFF \ -DWITH_LAPACK=OFF \ -DWITH_CLBLAST=ON \ -DENABLE_CONFIG_VERIFICATION=ON \ ../opencv ... -- OpenCL: YES (CLBlast INTELVA) -- Include path: /work/opencv/3rdparty/include/opencl/1.2 /usr/include -- Link libraries: Dynamic load ... -- Verifying WITH_CLBLAST=ON => 'HAVE_CLBLAST'=TRUE ...

BTW, I observe several identical warnings in matmul.dispatch.cpp (GCC 11, Ubuntu 22):

/opencv/modules/core/src/matmul.dispatch.cpp:147:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers] 147 | (const cl_mem)A.handle(ACCESS_READ), offsetA, lda, ... /opencv/modules/core/src/matmul.dispatch.cpp:184:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers] 184 | (const cl_mem)B.handle(ACCESS_READ), offsetB, ldb, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

BTW, I observe several identical warnings in matmul.dispatch.cpp (GCC 11, Ubuntu 22):

/opencv/modules/core/src/matmul.dispatch.cpp:147:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers] 147 | (const cl_mem)A.handle(ACCESS_READ), offsetA, lda, ... /opencv/modules/core/src/matmul.dispatch.cpp:184:31: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers] 184 | (const cl_mem)B.handle(ACCESS_READ), offsetB, ldb, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Warnings are resolved by removing const.

vpisarev · 2024-05-14T18:27:42Z

@fengyuentau, from the patch I can conclude that we need only a small portion of clblast. Can we extract a subset of clblast and put it to opencv/3rdparty and link it to OpenCV? (i.e. don't use dynamic loading, which is much less convenient for end users). Also, I believe, we need to solve problems with mac and intel somehow. I remember you said (and also see it from the performance charts) that the current Intel version of gemm in OpenCV is faster than clblast, maybe we should keep Intel version.

asmorkalov · 2024-05-17T07:57:03Z

@fengyuentau Thanks a lot for the effort! The PR was discussed on OpenCV Core team meeting and conclusion is the following:

Do not implement dynamic dependency for now.
Use find_package or alternative to find CLBlast as dependency and build against external library instance.
Do not put it to 3rdparty for now as soon as we have troubles with the most popular platforms: Intel and Apple ARM.

fengyuentau · 2024-05-17T10:24:18Z

we have troubles with the most popular platforms: Intel and Apple ARM.

I have done several testings on the clblast accuracy problem. It turns out clblast with tuning results on these platform gives incorrect results, and after reverting those tuning results it can give the correct results. See my repo for testing: https://github.com/fengyuentau/test-clblast.

fengyuentau added 2 commits May 10, 2024 17:35

initial commit

f06b90c

add auto-generated code

b5d03e3

fengyuentau added the category: core label May 10, 2024

fengyuentau requested review from vpisarev, asmorkalov and opencv-alalek May 10, 2024 09:37

fix ci: drop haveClblast

2720d5e

asmorkalov added the optimization label May 13, 2024

opencv-alalek added the category: ocl label May 13, 2024

opencv-alalek reviewed May 13, 2024

View reviewed changes

fengyuentau added 2 commits May 14, 2024 10:26

resolve review comments

28a5c50

resolve comment: improve logging

0a1e1c2

fengyuentau added 2 commits May 16, 2024 10:45

resolve warnings

3de29b5

skip clblast call if device vendor is apple or intel

f022945

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core (opencl): CLBlast integration via dyanmic loading #25568

core (opencl): CLBlast integration via dyanmic loading #25568

fengyuentau commented May 10, 2024 •

edited

fengyuentau commented May 11, 2024

opencv-alalek left a comment

opencv-alalek May 13, 2024

fengyuentau May 14, 2024

opencv-alalek May 17, 2024

opencv-alalek May 13, 2024

fengyuentau May 14, 2024

opencv-alalek May 13, 2024

mshabunin May 14, 2024

fengyuentau May 16, 2024

vpisarev commented May 14, 2024

asmorkalov commented May 17, 2024

fengyuentau commented May 17, 2024

core (opencl): CLBlast integration via dyanmic loading #25568

Are you sure you want to change the base?

core (opencl): CLBlast integration via dyanmic loading #25568

Conversation

fengyuentau commented May 10, 2024 • edited

Usage

Performance

Khadas VIM4 (8GB mem, 32GB disk space) with Mali G52 r1p0

Macbook Air M1 (16GB mem, 512GB disk space)

PC with i7-12700K (64GB mem, 1T disk space) with Intel(R) UHD Graphics 770

PC with GTX 1080 Ti (12GB gpu mem, CUDA 12.3)

Pull Request Readiness Checklist

fengyuentau commented May 11, 2024

opencv-alalek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpisarev commented May 14, 2024

asmorkalov commented May 17, 2024

fengyuentau commented May 17, 2024

fengyuentau commented May 10, 2024 •

edited