Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync with intel #5

Open
wozna opened this issue Sep 30, 2022 · 79 comments
Open

sync with intel #5

wozna opened this issue Sep 30, 2022 · 79 comments

Comments

@wozna
Copy link

wozna commented Sep 30, 2022

I have the same like error like you described here when I run cmake -B build https://github.com/zh794390558/paddle_build/blob/main/test/u2/patch/README.md. How can I resolve that?

@yaomichael
Copy link

@zh794390558

@yeliang2258
Copy link

you can try rename this lib and rebuild then: /workspace/Paddle/build_cuda/third_party/install/gflags/lib

@wozna
Copy link
Author

wozna commented Sep 30, 2022

@yeliang2258 Which lib should be renamed?

@zh794390558
Copy link
Owner

zh794390558 commented Sep 30, 2022

gflags.a conflicted, you neeed rm third_party/install/gflags/lib from paddle build dir.

@zh794390558
Copy link
Owner

如果编译时遇到gflags的报错,需要将paddle编译目录下gflags两个文件夹删除
image

执行时将test/u2/local/run.sh 中export LD_LIBRARY_PATH的路径修改为自己环境中的正确路径
执行时遇到libpaddle.so的问题将python下paddle安装目录中fluid目录下的libpaddle.so的soname重命名一下,执行patchelf –set-soname libpaddle.so libpaddle.so

@zh794390558
Copy link
Owner

cd test/u2,执行cmake -B build和cmake –build build进行编译,然后执行sh ./local/run.sh进行即可进行测试
编译时遇到找不到头文件的问题需要修改cmakelist
image

@zh794390558
Copy link
Owner

build paddle like this https://github.com/zh794390558/paddle_build/blob/main/scripts/build_gpu_py_test.sh

@jakpiase
Copy link

jakpiase commented Oct 9, 2022

Hi @zh794390558, I am taking over this task of optimizing U2++ model from Intel side. I need some clear steps of reproducing the performance of this model since now this repo is having some absolute paths included inside of it and the cmake script is enforcing 3.7.0 version of python and installing GPU stuff even if -DWITH_GPU is set to OFF. Also it seems like gflags and glog has some kind of conflict and manually copying files is not a convenient way to work with that. If you would be able to fix these issues then I will be glad to start working on optimizing this model.

@zh794390558
Copy link
Owner

zh794390558 commented Oct 10, 2022

Hi @zh794390558, I am taking over this task of optimizing U2++ model from Intel side. I need some clear steps of reproducing the performance of this model since now this repo is having some absolute paths included inside of it and the cmake script is enforcing 3.7.0 version of python and installing GPU stuff even if -DWITH_GPU is set to OFF. Also it seems like gflags and glog has some kind of conflict and manually copying files is not a convenient way to work with that. If you would be able to fix these issues then I will be glad to start working on optimizing this model.

  1. for gflags conflicted, you neeed to find gfalgs.a and rm it.
  2. you can set USE_PROFILING off like:
    option(USE_PROFILING "whether to do profiling" OFF)
    , then will not need absolute paths
  3. how to build
  • create and source python venv
PYTHON=python3.7
virtualenv -p ${PYTHON} venv
cmake -B build 
cmake --build build -j
  1. prepare test data
    like this: https://github.com/zh794390558/paddle_build/tree/main/test/u2#test-data

  2. download model
    https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/wenetspeech/asr1/README.md#u2-chunk-conformer-1

  3. run test

./local/run.sh

My workspace look like below:

total 1.9G
drwxr-xr-x 17 1008 1009  12K Oct 10 05:37 ./
drwxr-xr-x  7 1008 1009 4.0K Sep 29 11:06 ../
drwxr-xr-x  2 1008 1009 4.0K Sep 21 08:02 asr1_chunk_conformer_u2pp_wenetspeech_static_1.1.0.model/
-rw-r--r--  1 1008 1009 461M Sep 21 11:24 asr1_chunk_conformer_u2pp_wenetspeech_static_1.1.0.model.tar.gz
drwxr-xr-x  2 root root 4.0K Oct  9 12:21 asr1_chunk_conformer_u2pp_wenetspeech_static_quant_1.1.0.model/
-rw-r--r--  1 root root 523M Oct  8 09:44 asr1_chunk_conformer_u2pp_wenetspeech_static_quant_1.1.0.model.tar.gz
drwxr-xr-x  2 1008 1009 4.0K Sep 15 07:10 asr1_chunk_conformer_u2_wenetspeech_static_1.1.0.model/
-rw-r--r--  1 1008 1009 432M Sep 21 11:23 asr1_chunk_conformer_u2_wenetspeech_static_1.1.0.model.tar.gz
drwxr-xr-x  9 root root 4.0K Oct  9 12:14 build/
-rw-r--r--  1 1008 1009  897 Aug 17 03:31 .clang-format
drwxr-xr-x  2 1008 1009 4.0K Oct  9 07:02 cmake/
-rw-r--r--  1 root root 3.7K Oct  9 07:03 CMakeLists.txt
drwxr-xr-x  2 1008 1009 4.0K Sep 27 02:34 data/
drwxr-xr-x  2 1008 1009 4.0K Sep 21 07:02 decoder/
-rw-r--r--  1 1008 1009 6.6K Sep 19 07:46 decoder_main.cc
-rw-r--r--  1 1008 1009 190M Sep 30 10:14 decoder.main.prof
drwxr-xr-x  3 1008 1009 4.0K Sep 22 02:18 exp/
drwxr-xr-x 20 1008 1009 4.0K Oct  9 07:01 fc_base/
drwxr-xr-x  2 1008 1009 4.0K Sep 19 07:49 frontend/
-rw-r--r--  1 1008 1009  142 Sep 21 07:13 .gitignore
drwxr-xr-x  2 1008 1009 4.0K Oct  9 03:29 local/
-rw-r--r--  1 1008 1009 1.3K Sep 21 07:13 main.cc
-rw-r--r--  1 1008 1009 5.0K Sep 21 07:13 main_test.cc
-rw-r--r--  1 1008 1009 305M Sep 26 09:22 paddlepaddle_gpu-0.0.0-cp37-cp37m-linux_x86_64.whl
drwxr-xr-x  3 1008 1009 4.0K Aug 17 03:31 patch/
-rwxr-xr-x  1 1008 1009 9.5M Dec  7  2021 process_perf_linux*
-rw-r--r--  1 1008 1009 2.6K Sep 28 02:26 README.md
-rw-r--r--  1 1008 1009 2.4K Sep 27 06:09 sysconfig.py.bak
drwxr-xr-x  2 1008 1009 4.0K Aug 19 11:45 test/
drwxr-xr-x  2 1008 1009 4.0K Aug 17 03:31 utils/
drwxr-xr-x  2 1008 1009 4.0K Sep 21 12:00 wenet/
-rw-r--r--  1 1008 1009 157K Nov  5  2021 zh.wav

you can using data/wav.20.scp to quick test, data/wav.aishell.test.scp for hole test.

@zh794390558
Copy link
Owner

using paddle inference for int8 PaddlePaddle/Paddle#46821

@wozna
Copy link
Author

wozna commented Oct 10, 2022

I just managed to fully build a paddle_build. Now @jakpiase took over the task for me and will take care of data preparation and launching the model.

@zh794390558
Copy link
Owner

zh794390558 commented Oct 11, 2022

I using wenet with 638deb7ca859fb5eccdf696c48534ea8949d9a9e commit, which using torch 1.10.0 . The working dir is wenet/runtime/server/x86.

I create a PR with my modification, https://github.com/zh794390558/wenet/tree/u2_runtime_rtf.

Static model is from https://github.com/wenet-e2e/wenet/blob/main/docs/pretrained_models.en.md#model-list, wenetspeech.

If using torch 1.10.0 with no-quant model, you need download ckpt and using wenet/wenet/bin/export_jit.py to export no-quant static model.

torch 1.10 need ubuntu 18.04.

@zh794390558
Copy link
Owner

zh794390558 commented Oct 11, 2022

u2pp wenetspeech (20220506_u2pp_conformer_libtorch) model RTF

pytorch quant linear only, using dynamic quant.
benchmark with one thread, 100%CPU
ctc_weight 0.5, revrese_weight 0.3, rescoring_weight 1.0

Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz (w/ avx512_vnni)

  • pytorch

    • noquant: 0.2657
    • quant: 0.1334
  • paddle inference

    • not quant: 0.3787
    • quant: 0.7670 (20 utt)

Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (w/o avx512_vnni)

  • pytorch

    • noquant 0.283
    • quant: 0.1582
  • paddle pe

    • noquant: 0.8541 (20 utts) 100%CPU

    0.3512 (all) 6000%CPU; 0.3562 (20 utts) 6000%CPU;

    • quant: -
  • paddle inference

    • noquant: 0.4237 (20 utts) 100 %CPU
    • quant: -

benchmark with one thread, onednn need set OMP_NUM_THREADS=1
ref: https://oneapi-src.github.io/oneDNN/dev_guide_performance_settings.html

@zh794390558
Copy link
Owner

zh794390558 commented Oct 12, 2022

set export DNNL_VERBOSE=1, libtorch also using mkldnn for cpu backend.

I has a question, what's the difference of ONEDNN_VERBOSE and DNNL_VERBOSE?

@bin1guo
Copy link

bin1guo commented Oct 12, 2022

In old version, DNNL_VERBOSE is used to control whether oneDNN is used.
in the latest version, DNNL_VERBOSE becomes a cmake option, when cmake -DDNNL_VERBOSE=1, the runtime parameter ONEDNN_VERBOSE is used to control oneDNN on/off in runtime.
so if you want to use oneDNN, you can cmake -DDNNL_VERBOSE=1 by default, and set ONEDNN_VERBOSE =1 in your runtime.

@yeliang2258
Copy link

After profile, I found that the speed of FC OP after quantization is almost 2 times slower than before quantization.
This is the time of fp32:
image
This is the time of int8:
image

@yeliang2258
Copy link

yeliang2258 commented Oct 12, 2022

Hi @wozna @jakpiase @yaomichael
In my opinion, we now have two issues that need to be addressed:

  1. Non-quantized model, paddle is about 40% slower than torch
  2. Paddle's quantization model is 2 times slower than non-quantization, mainly due to FC OP

Is there any solution to these two problems? Or how should I work with you to solve these problems together?

@zh794390558
Copy link
Owner

zh794390558 commented Oct 12, 2022

FYI, open DNLL_VERBOSE=1 log the libtorch output with cat log | grep dnnl_verbose | grep exec | grep cpu | awk -F, '{ print $4}' | sort -u

the output of quant model is :

convolution
reorder

the output of non-quant model is :

convolution
reorder

@yeliang2258
Copy link

yeliang2258 commented Oct 12, 2022

I using wenet with 638deb7ca859fb5eccdf696c48534ea8949d9a9e commit, which using torch 1.10.0 . The working dir is wenet/runtime/server/x86.

I create a PR with my modification, https://github.com/zh794390558/wenet/tree/u2_runtime_rtf.

Static model is from https://github.com/wenet-e2e/wenet/blob/main/docs/pretrained_models.en.md#model-list, wenetspeech.

If using torch 1.10.0 with no-quant model, you need download ckpt and using wenet/wenet/bin/export_jit.py to export no-quant static model.

torch 1.10 need ubuntu 18.04.

Hi, @wozna @jakpiase @yaomichael You can reproduce the inference results of libtorch by following the steps below.

The whole process of inference with libtorch:

  1. install torch
python -m pip install torch==1.10.0 torchvision torchaudio==0.10.0 cudatoolkit==11.1
  1. git clone the wenet repo and build
https://github.com/zh794390558/wenet.git
cd wenet
git checkout u2_runtime_rtf
cd runtime/server/x86
mkdir build && cd build && cmake .. && cmake --build .
cd ..
  1. run int8 model
# run the run.sh script in wenet/runtime/server/x86 folder
 bash run.sh
  1. run fp32 model
# generate fp32 model first
cd wenet/wenet/bin
wget https://wenet-1256283475.cos.ap-shanghai.myqcloud.com/models/wenetspeech/20220506_u2pp_conformer_exp.tar.gz

tar -xf 20220506_u2pp_conformer_exp.tar.gz

python export_jit.py --config 20220506_u2pp_conformer_exp/train.yaml --checkpoint 20220506_u2pp_conformer_exp/final.pt --output_file 20220506_
u2pp_conformer_exp/final.zip

cd -

# change the model_dir in run.sh to the path of 20220506_u2pp_conformer_exp, then run it
bash run.sh

@jakpiase
Copy link

I was able to reproduce fp32 performance of PaddlePaddle's U2++ model by using ./local/download.sh script, but I cannot reproduce int8 performance since I have no access to the quantized model. There is no folder named ./local/export.sh in paddle_build repo and if I run ./local/export.sh under PaddleSpeech/examples/wenetspeech/asr1 directory then I get the following error:

using 0 gpus...
python3: can't open file '/export.py': [Errno 2] No such file or directory
Failed in export!

From what I can see in ./local/export.sh there is the following line:
image
and he problem is present because ${BIN_DIR} is not set to anything and I have no idea what value should it take.

@yeliang2258
Copy link

yeliang2258 commented Oct 13, 2022

@jakpiase hello,

  1. You can download the quantitative model from here, no need to export it by yourself
    https://drive.google.com/file/d/1YxZf_Xag2ar5YQUmzekrV2FZYwZ98iRo/view?usp=sharing

  2. To run the quantized model, you need to modify the Paddle/paddle/fluid/jit/engine/predictor_engine.h file as follows, then compile the paddle and reinstall it.
    image

  3. Finally, modify the model_dir in the run.sh script to the path of the quantized model, and execute the script.
    image

@zh794390558 zh794390558 changed the title Error with gflags_nothreads_static sync with intel Oct 13, 2022
@zh794390558
Copy link
Owner

zh794390558 commented Oct 14, 2022

u2pp wenetspeech (20220506_u2pp_conformer_libtorch) model RTF

pytorch quant linear only, using dynamic quant.
benchmark with one thread, 100%CPU
ctc_weight 0.5, revrese_weight 0.3, rescoring_weight 1.0

Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz (w/ avx512_vnni)

  • pytorch

    • FP32: 0.2657
    • INT8: 0.1334
  • paddle inference

    • FP32: 0.3425
    • INT8: 0.64

FP32 FC is using naive, mkldnn slower than it.

Some optimization for improving paddle inference RTF

@jakpiase
Copy link

jakpiase commented Oct 14, 2022

Hi, I have succesfully reproduced U2++ model both on PaddlePaddle and PyTorch, however in my case oneDNN's int8 just is a little bit slower than oneDNN's fp32. I have reproduced it on Intel(R) d on(R) Gold 6348H CPU @ 2.30GHz, because that is the only machine that has Ubuntu 18.04 and supports VNNI instruction set. I have also analyzed the model's graph and there are some strange and non-optimal scenarios in which unnecessary copying exists, f.e.:
image
In that case unsqueeze's X could be just stored as a persistent tensor inside conv2d, which would speed up the model because there would be less copying. How did you get this model? Did some automatic tool construct that from another model? Is there any way to change these parts of the model that are like this one on the picture or should we just write a pass that will fuse these two operators? In terms of other optimizations, we have spawned some tasks internally and we will start working on them on monday.

@zh794390558
Copy link
Owner

I have reproduced it on Intel(R) d on(R) Gold 6348H CPU @ 2.30GHz, because that is the only machine that has Ubuntu 18.04 and supports VNNI instruction set

Just libtorch need 18.04, paddle does not.

@zh794390558
Copy link
Owner

zh794390558 commented Oct 17, 2022

I have also analyzed the model's graph and there are some strange and non-optimal scenarios in which unnecessary copying exists

I think that it's because paddle does not has conv1d op, so using conv2d to impl it.

@yaomichael
Copy link

yaomichael commented Oct 17, 2022

MicrosoftTeams-image (5)

above is in fp32 which looks good as it has a pass handling weight which is missing in INT8 case. Maybe we should do the same in INT8 case. @jakpiase

@zh794390558
Copy link
Owner

Now I integrating the u2 model into paddlespeech, i encoder this problem below:

image

The pr is here PaddlePaddle/PaddleSpeech#2524, I using registry.baidubce.com/paddlepaddle/paddle:2.0.0-gpu-cuda10.2-cudnn7 docker image as env, please follow this cmd to reproduce the problem.

cd speechx
bash  tools/venv.sh
. venv/bin/activate
build. sh
pushd examples/u2pp_ol/wenetspeech
./run.sh --stop_stage 2

the log is in data/split40/xx/decoder.fbank.wolm.log

You can edit local/decode.sh nj to config how many processes is used to do decoding.

@yeliang2258
Copy link

yeliang2258 commented Oct 17, 2022

In the matmul_elementwise_add_mkldnn_fuse_pass.cc pass, in the original pass, because the axis of elementwise_add is 2, some matmul and elementwise_add will not be fused. When I make the following modifications to the pass, I found that matmul and elementwise_add are fused, but the speed is much slower.

I modifications the pass like this
image

and the fused op is
image

But the fused OP runs very slow
image

@jakpiase
Copy link

jakpiase commented Oct 25, 2022

Elementwise_add workaround is merged. After all changes I have ran profiling on int8 on CLX(Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz) U2++ on full dataset for PaddlePaddle and I have got RTF: 0.20

@yeliang2258
Copy link

yeliang2258 commented Oct 26, 2022

Cherry pick:

  1. [ cherrypick] Construct exec and ctx only once in cond op to speed up PaddlePaddle/Paddle#47012 for Construct exec and ctx only once in cond op to speed up PaddlePaddle/Paddle#47009
  2. [Cherry-pick] Added workaround for elementwise oneDNN kernel PaddlePaddle/Paddle#47342 for Added workaround for elementwise oneDNN kernel PaddlePaddle/Paddle#47080
  3. [cherry-pick] FC/matmul(v2) + scale fuse pass (#47127) PaddlePaddle/Paddle#47420 for FC/matmul(v2) + scale fuse pass PaddlePaddle/Paddle#47127
  4. [Cherry pick] Fix ComputePropagateScalesMkldnnPass of MKLDNN PaddlePaddle/Paddle#47639 for Fix ComputePropagateScalesMkldnnPass of MKLDNN PaddlePaddle/Paddle#47574
  5. [CHERRY-PICK] Added caching to oneDNN FC and op+unsqueeze2 and op+reshape2 fuse passes PaddlePaddle/Paddle#47690 for Optimized oneDNN FC and added operator+unsqueeze2 and operator+reshape2 oneDNN fuse passes PaddlePaddle/Paddle#47391 and Fix for bias caching and scales computation optimization for oneDNN FC PaddlePaddle/Paddle#47234
  6. [cherry-pick] Squeeze2 and transpose2 fuse using oneDNN PaddlePaddle/Paddle#47712 for suqeeze2 + transpose2 fuse onednn PaddlePaddle/Paddle#47592
  7. [cherry-pick] updating mul and matmul with set_mem_desc and fix squeeze_transpose for MKLDNN PaddlePaddle/Paddle#47951 for updating mul and matmul with set_mem_desc PaddlePaddle/Paddle#45624 and Fix squeeze_transpose fuse pass for MKLDNN PaddlePaddle/Paddle#47911
  8. [Cherry-pick] Fix slice bugs in MKLDNN when input dims are zeros PaddlePaddle/Paddle#47887 for Fix slice bugs in MKLDNN when input dims are zeros PaddlePaddle/Paddle#46671

Miss cherry-pick:
PaddlePaddle/Paddle#47780

@yeliang2258
Copy link

Elementwise_add workaround is merged. After all changes I have ran profiling on int8 on CLX(Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz) U2++ on full dataset for PaddlePaddle and I have got RTF: 0.20

Hi @jakpiase Can I reproduce your results with this paddle? https://github.com/jakpiase/Paddle/tree/temporary_u2_perf

@jakpiase
Copy link

Hi @yeliang2258 I will prepare next branch with my newest changes today at evening

@jakpiase
Copy link

@yeliang2258 @zh794390558 At this branch: https://github.com/jakpiase/Paddle/tree/temporary_u2_perf there are the newest optimizations featuring: (FC/TRANSPOSE2) + RESHAPE2 fuse pass, (TRANSPOSE2/ELEMENTWISE_MUL) + UNSQUEEZE2 fuse pass and general oneDNN FC optimizations. Unfortunately PaddlePaddle's develop branch is broken and my PR: #47391 cannot pass the CI. Also our 6271C machine is having some problems with firewall and I cannot download Eigen and build the newest version and prepare any measurements. Could you please test this branch for three options: OneDNN int8, OneDNN FP32 (without oneDNN FC), OneDNN FP32 (with OneDNN FC) and send the measurements here?

@zh794390558
Copy link
Owner

zh794390558 commented Oct 27, 2022

@jakpiase @yaomichael I test quant model CER under 6148, but the result is poor, which is:

CER

  • int8: 19.74%
  • only using FC quant, int8: 19.78%
  • disable all optimization, using fp32: 5.81%

This script to run quant model:
https://github.com/zh794390558/paddle_build/blob/main/test/u2/local/run_quant.sh

the quantization model:
http://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/static/asr1_chunk_conformer_u2pp_wenetspeech_static_quant_1.3.0.model.tar.gz

Using this script to compute CER:

./local/compute-wer.py --char=1 --v=1 data/text exp/wav.aishell.test.chunk16.quant.hyp > exp/wav.aishell.test.chunk16.quant.hyp.err  

Under 6271C, CER is also poor, 19.75%(only using FC quant, int8)

when I check onednn log, I find these. Should binary using jit?

onednn_verbose,exec,cpu,binary,ref:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,alg:binary_add,1x8x16x64:1x8x1x64,0.906982
onednn_verbose,exec,cpu,binary,ref:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,alg:binary_add,1x8x16x64:1x8x1x64,0.883057

inner_product logs, what's the diff with gemm_s8u8s32 and gemm_s8s8s32

onednn_verbose,exec,cpu,inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic2048oc512,2.99878
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic512oc512,1.07812
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ab:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic512oc1024,2.37793
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic512oc512,0.828857
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic512oc512,1.37207
onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb1230ic512oc1024,17.696
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic512oc512,0.819092
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ab:f0 bia_f32::blocked:a:f0 dst_u8::blocked:ab:f0,attr-oscale:2 attr-post-ops:eltwise_relu:0:0:3.21562 ,,mb210ic512oc2048,3.24609
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic2048oc512,2.99683
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic512oc5538,13.698
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb210ic512oc5538,9.19092
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ab:f0 bia_f32::blocked:a:f0 dst_s8::blocked:ab:f0,attr-oscale:2 attr-post-ops:eltwise_swish:1 ,,mb16ic512oc2048,9.52295
onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb200ic512oc1024,2.79004
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ab:f0 bia_f32::blocked:a:f0 dst_u8::blocked:ab:f0,attr-oscale:2 attr-post-ops:eltwise_relu:0:0:2.86064 ,,mb80
ic512oc2048,2.45386
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb80ic2048oc512,1.3999
onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,,,mb200ic512oc1024,2.75879
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ab:f0 bia_f32::blocked:a:f0 dst_u8::blocked:ab:f0,attr-oscale:2 attr-post-ops:eltwise_relu:0:0:10.0521 ,,mb80
ic512oc2048,1.99316
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb80ic2048oc512,1.34985

onednn_verbose,exec,cpu,inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb80ic2048oc512,1.42603
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb80ic512oc5538,5.5061
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8s8s32:jit,forward_inference,src_s8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb80ic512oc5538,5.49902
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb16ic9728oc512,2.76099

onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core,forward_inference,src_f32::blocked:ab:f0 wei_f32:p:blocked:AB16b64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-post-ops:eltwise_linear:1 ,,mb16ic512oc5538,2.30811
onednn_verbose,exec,cpu,inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0,attr-oscale:2 ,,mb16ic9728oc512,2.37915

@yeliang2258
Copy link

yeliang2258 commented Oct 27, 2022

I found a new 6271C machine last night and the test results are as follows:

libtorch:
        FP32:RTF: 0.242
        INT8:RTF: 0.1207

intel temporary_u2_perf paddle:
        FP32 with OneDNN FC:RTF: 0.26
        FP32 Without OneDNN FC:RTF: 0.2899
        INT8:RTF: 0.1863

Paddle develop:
       FP32:RTF: 0.2473
       INT8:RTF: 0.2006 -> 0.1802

@jakpiase
Copy link

@yeliang2258, I have uploaded a fix for that 5th case without oneDNN FC into https://github.com/jakpiase/Paddle/tree/temporary_u2_perf branch.

@zh794390558 the difference between s8s8f32 and s8u8f32 is the datatype of weights in inner_product(fully connected) oneDNN primitive. These three parts of s8u8f32 describe:
s8 - int8 source memory
u8 - uint8 weights memory
f32 - float32 destination memory
On my branch https://github.com/jakpiase/Paddle/tree/temporary_u2_perf I was not able to reproduce the ref binary kernel. I'll look into these CER things tomorrow morning

@zh794390558
Copy link
Owner

zh794390558 commented Oct 28, 2022

@jakpiase

When will choose to use u8 or s8 for weight, does it has some hint? Should u8 for src, s8 for weight?

inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0

How can we debug with int8 kernel to find which causes this precision error?
I find some FC will run fp32, so not all fc be quanted?

@yeliang2258
Copy link

Hi, @yaomichael @jczaja The following PR is to fix the accuracy of the picodet. After the PR was merged, we found that the accuracy of the picodet was only 29%, indicating that the problem of accuracy has not been completely fixed.
I'm not sure if the accuracy issue with the U2++ int8 model is related to this PR.
PaddlePaddle/Paddle#46378

@wozna
Copy link
Author

wozna commented Oct 28, 2022

@jakpiase

When will choose to use u8 or s8 for weight, does it has some hint? Should u8 for src, s8 for weight?

inner_product,x64:gemm_s8u8s32:jit,forward_inference,src_u8::blocked:ab:f0 wei_s8::blocked:ba:f0 bia_f32::blocked:a:f0 dst_f32::blocked:ab:f0

How can we debug with int8 kernel to find which causes this precision error? I find some FC will run fp32, so not all fc be quanted?

image

In OneDNN weights can be only s8, and input can be s8 or u8. UINT8 input is set when before FC is some activations that return positive values like ReLU. So this gemm_s8u8s32 `u8` show input data type. But maybe it should be verified if out data types for weights are correct.

@zh794390558
Copy link
Owner

zh794390558 commented Oct 31, 2022

@wozna

I look through compute_propagate_scales_mkldnn_pass.cc, does the UpdateScaleOpInScale implement right? What's the function of the method? and why not div scale? Should we remove scale op?

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/ir/mkldnn/compute_propagate_scales_mkldnn_pass.cc#L354

Should it be look like this?

void ComputePropagateScalesMkldnnPass::UpdateScaleOpInScale(
    Node* op_node,
    const std::string& input_name,
    const std::string& output_name,
    StringPairMap* var_quant_scales) const {
  auto iter = var_quant_scales->find(output_name);
  if (iter != var_quant_scales->end()) {
    auto pair = iter->second;
    const auto tensor = pair.second;

    const auto scale = PADDLE_GET_CONST(float, op_node->Op()->GetAttr("scale"));
    phi::DenseTensor tmp_tensor;
    tmp_tensor.Resize(tensor.dims());
    auto* data = tmp_tensor.mutable_data<float>(platform::CPUPlace());
    auto* src_data = tensor.data<float>();
    for (int i = 0; i < tensor.numel(); i++) {
      data[i] = src_data[i] * scale;
    }

    auto new_pair = std::make_pair(pair.first, tmp_tensor);
    var_quant_scales->insert(std::make_pair(input_name, new_pair));
  }
}

@zh794390558
Copy link
Owner

zh794390558 commented Oct 31, 2022

For the CER of the quantization model, I have tested under develop of 2953b708a03d023b6b6b1fecde7ac431f8f48a94 commit, which is 5.83%.

remove compute_propagate_scales_mkldnn_pass from https://github.com/jakpiase/Paddle/tree/temporary_u2_perf , cer can be 5.7%

@zh794390558
Copy link
Owner

zh794390558 commented Nov 1, 2022

This PR PaddlePaddle/Paddle#47391 with constant folding merged with develop, CER can be 5.8%

@yeliang2258
Copy link

yeliang2258 commented Nov 1, 2022

@jakpiase Please update your paddle branch, because we found that the accuracy of quantized model is expected when using the paddle develop branch.

@yeliang2258
Copy link

yeliang2258 commented Nov 2, 2022

Fixed by this: PaddlePaddle/Paddle#47574

@wozna

I look through compute_propagate_scales_mkldnn_pass.cc, does the UpdateScaleOpInScale implement right? What's the function of the method? and why not div scale? Should we remove scale op?

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/ir/mkldnn/compute_propagate_scales_mkldnn_pass.cc#L354

Should it be look like this?

void ComputePropagateScalesMkldnnPass::UpdateScaleOpInScale(
    Node* op_node,
    const std::string& input_name,
    const std::string& output_name,
    StringPairMap* var_quant_scales) const {
  auto iter = var_quant_scales->find(output_name);
  if (iter != var_quant_scales->end()) {
    auto pair = iter->second;
    const auto tensor = pair.second;

    const auto scale = PADDLE_GET_CONST(float, op_node->Op()->GetAttr("scale"));
    phi::DenseTensor tmp_tensor;
    tmp_tensor.Resize(tensor.dims());
    auto* data = tmp_tensor.mutable_data<float>(platform::CPUPlace());
    auto* src_data = tensor.data<float>();
    for (int i = 0; i < tensor.numel(); i++) {
      data[i] = src_data[i] * scale;
    }

    auto new_pair = std::make_pair(pair.first, tmp_tensor);
    var_quant_scales->insert(std::make_pair(input_name, new_pair));
  }
}

@zh794390558
Copy link
Owner

zh794390558 commented Nov 3, 2022

@jczaja @wozna @yeliang2258 I add a squeeze2+transpose2 fuse for onednn PaddlePaddle/Paddle#47592, please have a look.

Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0

6271C machine

FP32
w/o this pass:RTF: 0.2509
w/ this pass:RTF: 0.2465
relative improve:1.75%

Int8 :
w/o this pass:RTF: 0.2194
w/ this pass:RTF: 0.2042
relative improve: 6.93%

cherry-pick 2.4 PaddlePaddle/Paddle#47712

@jczaja
Copy link

jczaja commented Nov 8, 2022

@zh794390558 , @yeliang2258 I tried to compare pytorch u2++ model vs PaddlePaddle u2++ model to check if models are the same. For example I checked number of convolutions and shapes and number of matrix multiplication operations and shapes.
And there is one element that I cannot see in pytorch e.g. PaddlePaddle model has 12 instances of MUL operator that are not present in pytorch. In other words Pytorch u2++ model is having 195 matrix multiplication operations while paddlepaddle got 207 matrix multiplication operations. These 12 matrix multiplications that are present in PaddlePaddle U2++ is MUL operations.
u2pp_mul
The input to MUL comes from slice op:
u2pp_slice

  1. Could you please explain if those MUL operations are part of original U2++ as present in pytorch or maybe something specific to paddlepaddle?
  2. Could you point me to code where PaddlePaddle u2++ model is defined?

@zh794390558
Copy link
Owner

2. Could you point me to code where PaddlePaddle u2++ model is defined?

https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/s2t/models/u2/u2.py#L856

@zh794390558
Copy link
Owner

u2pp_slice

I think this part is for positional embedding.

@wozna
Copy link
Author

wozna commented Nov 8, 2022

I guess I found the main int8 problem, because not all fc ops were quantized even if they all have scales.

It improves U2++ int8 model:

  • in the encoder model before only 61 fc was quantized and right now 85 fc is quantized.
  • in the decoder model before only 39 was quantized and right now 44 fc is quantized.

I prepared PR PaddlePaddle/Paddle#47780 with an explanation what was wrong. Please check results with this PR.

I tested it on my CPX machine

  • before RTF: 0.1693
  • now RTF: 0.1481

@jczaja
Copy link

jczaja commented Nov 14, 2022

@zh794390558 , @yeliang2258 , @yaomichael

This is one of recent profilings of U2++ / PaddlePaddle on CLX. You can see that right now second operator that is taking most of time is conditional_block_infer . This operator is having Executor inside that is executed its child operators (intended rows on profiling).

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       Min.        Max.        Ave.        Ratio.      
fc                                         71012       24000.5     0.064178    22.4144     0.337978    0.27802     
conditional_block_infer                    39240       11382.6     0.002709    22.0537     0.290077    0.131856    
  Executor::RunPartialPreparedContext      19620       11088.2     0.070964    22.012      0.565145    0.128445    
    elementwise_add                        6060        3969.06     0.019537    3.89904     0.654961    0.0459774   
    assign                                 47232       1944.56     0.015345    2.59505     0.0411703   0.0225256   
    concat                                 23976       1207.47     0.029246    1.44262     0.0503618   0.0139873   
    softmax                                1212        854.235     0.051912    2.59794     0.704815    0.00989542  
    split                                  7992        754.78      0.041836    0.295232    0.0944419   0.00874333  
    fill_any_like                          6060        309.669     0.013966    0.677781    0.0511005   0.00358719  
    where                                  2424        245.602     0.019999    0.595704    0.101321    0.00284504  
    slice                                  8484        233.224     0.019615    0.062798    0.0274899   0.00270166  
    pad3d                                  1212        198.697     0.152893    0.558197    0.163942    0.0023017   
    cast                                   3636        126.302     0.01494     0.504419    0.0347366   0.00146308  
    fill_constant                          4848        112.083     0.016298    0.045141    0.0231195   0.00129837  
    unsqueeze2                             2424        106.538     0.036955    0.116693    0.0439512   0.00123413  
    expand_v2                              2424        97.7336     0.029475    0.087808    0.0403192   0.00113214  
    shape                                  3636        86.7803     0.014969    0.048187    0.023867    0.00100526  
    equal                                  1212        34.5323     0.023708    0.057693    0.028492    0.000400021 
    squeeze2                               1212        33.9919     0.022754    0.560127    0.0280461   0.000393761 

conv2d                                     29146       11084.1     0.081788    9.02908     0.380295    0.128397    
layer_norm                                 58011       4069.56     0.054913    2.19999     0.0701514   0.0471415   

Problem is that this ops that belong to conditional_block_infer operator they are not a subject to IR passes. So they cannot be executed by oneDNN kernels neither they are subject to fuses. Baidu engineers were working on that problem: PaddlePaddle/Paddle#17003 , but it was never solved. To speed up U2++ a bit more it would be good to have this issue fixed . Could you please resume work on enabling IR passes for operators that are part of conditional_block_infer?

@jczaja
Copy link

jczaja commented Nov 21, 2022

Overhead analysis (paddle int8 vs pytorch int8)

One of elements that contribute to performance gap of int8 execution is bigger overhead in case of paddle.

U2++ Paddle int8 Overhead is around : 23.5%
U2++ Pytorch int8 overhead is around: 9.8%

Details of comparison

Flamegraphs were generated for both paddle and pytorch when processing full data set (as given in instruction).

Paddle U2++ int8

u2++-paddle-int8-CPX

Pytorch U2++ int8

u2++-libtorch-int8-CPX

From flamegraphs we can get overhead :
overhead[%] = 100% - decode(spectral analysis, handling input data etc.) - operators

pytorch decode: 2.3%        # This was done by searching(CTRL+f) and typing "AcceptWav|PrefixBeam"
pytorch operators take: 87.8%  # "at::native"

paddle decode: 10.3%      #  "AcceptWav|PrefixBeam"
paddle operators: 68.9%  #  "Kernel<|primitive::execute|ConditionalBlockInfer|Eltwise|Handler<"
pytorch framework overhead: 100% - 87.8% - 1.59% - 0.7% = 9.8%   # "at::native|PrefixBeam|AcceptWav"
paddle framework overhead: 100% - 68.9% - 7.6% = 23.5%

For Paddle we have also profiling from the same machine and the same experiment which shows that Paddle overhead is 22.28%:

-------------------------     Overhead Summary      -------------------------

Total time: 62192.4 

  Computation time       Total: 48333.3     Ratio: 77.7157%
  Framework overhead     Total: 13859.1     Ratio: 22.2843%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 0           Total: 0           Ratio: 0%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       Min.        Max.        Ave.        Ratio.      
fc                                         71012       18369.5     0.048279    22.2501     0.258681    0.295365    

Conclusion:

One of the elements responsible for poorer performance of U2++ Paddle int8 is bigger framework overhead (~23%) than in case of pytorch (9.8%). So perhaps further work on reducing overhead is important and needed.

@zh794390558
Copy link
Owner

@zh794390558 , @yeliang2258 , @yaomichael

This is one of recent profilings of U2++ / PaddlePaddle on CLX. You can see that right now second operator that is taking most of time is conditional_block_infer . This operator is having Executor inside that is executed its child operators (intended rows on profiling).

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       Min.        Max.        Ave.        Ratio.      
fc                                         71012       24000.5     0.064178    22.4144     0.337978    0.27802     
conditional_block_infer                    39240       11382.6     0.002709    22.0537     0.290077    0.131856    
  Executor::RunPartialPreparedContext      19620       11088.2     0.070964    22.012      0.565145    0.128445    
    elementwise_add                        6060        3969.06     0.019537    3.89904     0.654961    0.0459774   
    assign                                 47232       1944.56     0.015345    2.59505     0.0411703   0.0225256   
    concat                                 23976       1207.47     0.029246    1.44262     0.0503618   0.0139873   
    softmax                                1212        854.235     0.051912    2.59794     0.704815    0.00989542  
    split                                  7992        754.78      0.041836    0.295232    0.0944419   0.00874333  
    fill_any_like                          6060        309.669     0.013966    0.677781    0.0511005   0.00358719  
    where                                  2424        245.602     0.019999    0.595704    0.101321    0.00284504  
    slice                                  8484        233.224     0.019615    0.062798    0.0274899   0.00270166  
    pad3d                                  1212        198.697     0.152893    0.558197    0.163942    0.0023017   
    cast                                   3636        126.302     0.01494     0.504419    0.0347366   0.00146308  
    fill_constant                          4848        112.083     0.016298    0.045141    0.0231195   0.00129837  
    unsqueeze2                             2424        106.538     0.036955    0.116693    0.0439512   0.00123413  
    expand_v2                              2424        97.7336     0.029475    0.087808    0.0403192   0.00113214  
    shape                                  3636        86.7803     0.014969    0.048187    0.023867    0.00100526  
    equal                                  1212        34.5323     0.023708    0.057693    0.028492    0.000400021 
    squeeze2                               1212        33.9919     0.022754    0.560127    0.0280461   0.000393761 

conv2d                                     29146       11084.1     0.081788    9.02908     0.380295    0.128397    
layer_norm                                 58011       4069.56     0.054913    2.19999     0.0701514   0.0471415   

Problem is that this ops that belong to conditional_block_infer operator they are not a subject to IR passes. So they cannot be executed by oneDNN kernels neither they are subject to fuses. Baidu engineers were working on that problem: PaddlePaddle/Paddle#17003 , but it was never solved. To speed up U2++ a bit more it would be good to have this issue fixed . Could you please resume work on enabling IR passes for operators that are part of conditional_block_infer?

@yeliang2258 will focus on this problem.

@zh794390558
Copy link
Owner

Overhead analysis (paddle int8 vs pytorch int8)

One of elements that contribute to performance gap of int8 execution is bigger overhead in case of paddle.

U2++ Paddle int8 Overhead is around : 23.5% U2++ Pytorch int8 overhead is around: 9.8%

Details of comparison

Flamegraphs were generated for both paddle and pytorch when processing full data set (as given in instruction).

Paddle U2++ int8

u2++-paddle-int8-CPX

Pytorch U2++ int8

u2++-libtorch-int8-CPX

From flamegraphs we can get overhead : overhead[%] = 100% - decode(spectral analysis, handling input data etc.) - operators

pytorch decode: 2.3%        # This was done by searching(CTRL+f) and typing "AcceptWav|PrefixBeam"
pytorch operators take: 87.8%  # "at::native"

paddle decode: 10.3%      #  "AcceptWav|PrefixBeam"
paddle operators: 68.9%  #  "Kernel<|primitive::execute|ConditionalBlockInfer|Eltwise|Handler<"
pytorch framework overhead: 100% - 87.8% - 1.59% - 0.7% = 9.8%   # "at::native|PrefixBeam|AcceptWav"
paddle framework overhead: 100% - 68.9% - 7.6% = 23.5%

For Paddle we have also profiling from the same machine and the same experiment which shows that Paddle overhead is 22.28%:

-------------------------     Overhead Summary      -------------------------

Total time: 62192.4 

  Computation time       Total: 48333.3     Ratio: 77.7157%
  Framework overhead     Total: 13859.1     Ratio: 22.2843%

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 0           Total: 0           Ratio: 0%

-------------------------       Event Summary       -------------------------

Event                                      Calls       Total       Min.        Max.        Ave.        Ratio.      
fc                                         71012       18369.5     0.048279    22.2501     0.258681    0.295365    

Conclusion:

One of the elements responsible for poorer performance of U2++ Paddle int8 is bigger framework overhead (~23%) than in case of pytorch (9.8%). So perhaps further work on reducing overhead is important and needed.

Another question, how to generate flame graph?

@jczaja
Copy link

jczaja commented Nov 25, 2022

@zh794390558
Flamegraphs (Explained here: https://www.brendangregg.com/flamegraphs.html) are quite popular and supported by number of software. We have tested two methods of generation of flamegraphs:

  1. Use Intel Vtune . This method just require preparing the workload (building with symbols) and of course having Intel Vtune installed.
  2. Use linux kernel profiler: perf + flamegraph generator. This method require more steps and I will focus on that one as I'm more familiar with it.

Basic example:

To profile linux command sleep 1 we can use program flamegraph.

Command to produce flamegraph:

flamegraph -- sleep 1
And output is:
flamegraph

Now, as Github Markup is blocking some SVG scripts. You should download this file and then open it in your web browser (firefox and chrome would work) for inspection.

Produced example does have "unknown" blocks which is because we miss debug symbols of profiled workload.

Preparing workload

For PaddlePaddle we would need to build as Optimized but with debug sysmbols . For example:
cmake ../ -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_TESTING=OFF -DWITH_GPU=OFF -DWITH_PYTHON=ON -DPY_VERSION=3.9 -DWITH_LITE=OFF -DON_INFER=ON

Now we need a perf profiler and software to make a flamegraph out of profiling made by perf.

Installing perf and flamegraph

There are two methods here:

  1. Installing flamegraph software as described here. Notice that this software is written in Rust programming language and installation of it will require Rust compiler(rustc) and package manager(cargo) installed as described here
  2. You can use original repository of Brendan Gregg (inventor of flamegraphs) to manualy to generated flamegraphs from existing perf data: https://github.com/brendangregg/FlameGraph . You still need to have perf installed

The next thing is that oneDNN is generating assembly code in runtime e.g. JIT code. And profiling of this kind of code was added to perf / linux kernel a bit later. So we need recent linux kernel

Operating system requirements

It works fine on Centos8+ and ubuntu 18.04+ . For other OS'es anything providing Linux Kernel 5.0+ should be fine .

We also need to customize perf to annotate JIT code and to let know oneDNN that annotations are for perf format.

Running U2++ pytorch to get flamegraph

DNNL_JIT_PROFILE=6 flamegraph -c 'record -F 997 -k 1 --call-graph lbr -D 5000 ' -v -- ./build/decoder_main <rest of arguments>

Legend:

- -D 5000 instructs perf to start profiling 5 second from the moment of U2++ inference was started . I'm not interested in profiling of fuses and initialization so that Is why 5s delay is introduced

  • DNNL_JIT_PROFILE=6 env var to instructs oneDNN to annotate JIT code for perf profiler
  • -c overwrites default settings of flamegraph for perf
  • -k 1 required by oneDNN
  • --call-graph lbr use Intel HW stack walker (limitation is 32 levels only)

running perf introduce 5-10% overhead to execution, but also generation of flaemgraph may take very long time if
workload was running for an hour or so . so in general it is good to limit experiment to 5-10 min.

Generating flamegraph using scripts

This is a method of flaemgraph generation using scripts rather than Rust based flamegraph crate

 DNNL_JIT_PROFILE=6 perf record -k1 --call-graph lbr -D 5000 ./buld/decoder_main .......
perf inject -j -i perf.data -o perf.data.j
echo "Folding callstacks.."
perf script -i perf.data.j | /<my path>/FlameGraph/stackcollapse-perf.pl > out.perf-folded
echo "Generating flamegraph..."
/<my path>/FlameGraph/flamegraph.pl out.perf-folded > u2++-pytorch-flamegraph.svg

@zh794390558
Copy link
Owner

@zh794390558 Flamegraphs (Explained here: https://www.brendangregg.com/flamegraphs.html) are quite popular and supported by number of software. We have tested two methods of generation of flamegraphs:

  1. Use Intel Vtune . This method just require preparing the workload (building with symbols) and of course having Intel Vtune installed.
  2. Use linux kernel profiler: perf + flamegraph generator. This method require more steps and I will focus on that one as I'm more familiar with it.

Basic example:

To profile linux command sleep 1 we can use program flamegraph.

Command to produce flamegraph:

flamegraph -- sleep 1 And output is: flamegraph

Now, as Github Markup is blocking some SVG scripts. You should download this file and then open it in your web browser (firefox and chrome would work) for inspection.

Produced example does have "unknown" blocks which is because we miss debug symbols of profiled workload.

Preparing workload

For PaddlePaddle we would need to build as Optimized but with debug sysmbols . For example: cmake ../ -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_TESTING=OFF -DWITH_GPU=OFF -DWITH_PYTHON=ON -DPY_VERSION=3.9 -DWITH_LITE=OFF -DON_INFER=ON

Now we need a perf profiler and software to make a flamegraph out of profiling made by perf.

Installing perf and flamegraph

There are two methods here:

  1. Installing flamegraph software as described here. Notice that this software is written in Rust programming language and installation of it will require Rust compiler(rustc) and package manager(cargo) installed as described here
  2. You can use original repository of Brendan Gregg (inventor of flamegraphs) to manualy to generated flamegraphs from existing perf data: https://github.com/brendangregg/FlameGraph . You still need to have perf installed

The next thing is that oneDNN is generating assembly code in runtime e.g. JIT code. And profiling of this kind of code was added to perf / linux kernel a bit later. So we need recent linux kernel

Operating system requirements

It works fine on Centos8+ and ubuntu 18.04+ . For other OS'es anything providing Linux Kernel 5.0+ should be fine .

We also need to customize perf to annotate JIT code and to let know oneDNN that annotations are for perf format.

Running U2++ pytorch to get flamegraph

DNNL_JIT_PROFILE=6 flamegraph -c 'record -F 997 -k 1 --call-graph lbr -D 5000 ' -v -- ./build/decoder_main <rest of arguments>

Legend:

- -D 5000 instructs perf to start profiling 5 second from the moment of U2++ inference was started . I'm not interested in profiling of fuses and initialization so that Is why 5s delay is introduced

  • DNNL_JIT_PROFILE=6 env var to instructs oneDNN to annotate JIT code for perf profiler
  • -c overwrites default settings of flamegraph for perf
  • -k 1 required by oneDNN
  • --call-graph lbr use Intel HW stack walker (limitation is 32 levels only)

running perf introduce 5-10% overhead to execution, but also generation of flaemgraph may take very long time if workload was running for an hour or so . so in general it is good to limit experiment to 5-10 min.

Generating flamegraph using scripts

This is a method of flaemgraph generation using scripts rather than Rust based flamegraph crate

 DNNL_JIT_PROFILE=6 perf record -k1 --call-graph lbr -D 5000 ./buld/decoder_main .......
perf inject -j -i perf.data -o perf.data.j
echo "Folding callstacks.."
perf script -i perf.data.j | /<my path>/FlameGraph/stackcollapse-perf.pl > out.perf-folded
echo "Generating flamegraph..."
/<my path>/FlameGraph/flamegraph.pl out.perf-folded > u2++-pytorch-flamegraph.svg

Thank you very much for the guide.

@zh794390558
Copy link
Owner

latest result from yeliang

  • Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz (w/ avx512_vnni)
    • pytorch
      • noquant: 0.242
      • quant: 0.1207
    • paddle inference
      • not quant: 0.3371 -> 0.2473
      • quant: 0.64 -> 0.1802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants