New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync with intel #5
Comments
you can try rename this lib and rebuild then: /workspace/Paddle/build_cuda/third_party/install/gflags/lib |
@yeliang2258 Which lib should be renamed? |
|
build paddle like this |
Hi @zh794390558, I am taking over this task of optimizing U2++ model from Intel side. I need some clear steps of reproducing the performance of this model since now this repo is having some absolute paths included inside of it and the cmake script is enforcing 3.7.0 version of python and installing GPU stuff even if -DWITH_GPU is set to OFF. Also it seems like gflags and glog has some kind of conflict and manually copying files is not a convenient way to work with that. If you would be able to fix these issues then I will be glad to start working on optimizing this model. |
My workspace look like below:
you can using |
using paddle inference for int8 PaddlePaddle/Paddle#46821 |
I just managed to fully build a paddle_build. Now @jakpiase took over the task for me and will take care of data preparation and launching the model. |
I using wenet with 638deb7ca859fb5eccdf696c48534ea8949d9a9e commit, which using torch 1.10.0 . The working dir is wenet/runtime/server/x86. I create a PR with my modification, https://github.com/zh794390558/wenet/tree/u2_runtime_rtf. Static model is from https://github.com/wenet-e2e/wenet/blob/main/docs/pretrained_models.en.md#model-list, wenetspeech. If using torch 1.10.0 with no-quant model, you need download ckpt and using wenet/wenet/bin/export_jit.py to export no-quant static model. torch 1.10 need ubuntu 18.04. |
u2pp wenetspeech (20220506_u2pp_conformer_libtorch) model RTF
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz (w/ avx512_vnni)
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (w/o avx512_vnni)
|
set I has a question, what's the difference of |
In old version, DNNL_VERBOSE is used to control whether oneDNN is used. |
Hi @wozna @jakpiase @yaomichael
Is there any solution to these two problems? Or how should I work with you to solve these problems together? |
FYI, open DNLL_VERBOSE=1 log the libtorch output with the output of quant model is :
the output of non-quant model is :
|
Hi, @wozna @jakpiase @yaomichael You can reproduce the inference results of libtorch by following the steps below. The whole process of inference with libtorch:
|
@jakpiase hello,
|
u2pp wenetspeech (20220506_u2pp_conformer_libtorch) model RTF
Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz (w/ avx512_vnni)
Some optimization for improving paddle inference RTF
|
Just libtorch need 18.04, paddle does not. |
I think that it's because paddle does not has conv1d op, so using conv2d to impl it. |
above is in fp32 which looks good as it has a pass handling weight which is missing in INT8 case. Maybe we should do the same in INT8 case. @jakpiase |
Now I integrating the u2 model into paddlespeech, i encoder this problem below: The pr is here PaddlePaddle/PaddleSpeech#2524, I using
the log is in You can edit |
In the matmul_elementwise_add_mkldnn_fuse_pass.cc pass, in the original pass, because the axis of elementwise_add is 2, some matmul and elementwise_add will not be fused. When I make the following modifications to the pass, I found that matmul and elementwise_add are fused, but the speed is much slower. |
Elementwise_add workaround is merged. After all changes I have ran profiling on int8 on CLX(Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz) U2++ on full dataset for PaddlePaddle and I have got RTF: 0.20 |
Hi @jakpiase Can I reproduce your results with this paddle? https://github.com/jakpiase/Paddle/tree/temporary_u2_perf |
Hi @yeliang2258 I will prepare next branch with my newest changes today at evening |
@yeliang2258 @zh794390558 At this branch: https://github.com/jakpiase/Paddle/tree/temporary_u2_perf there are the newest optimizations featuring: (FC/TRANSPOSE2) + RESHAPE2 fuse pass, (TRANSPOSE2/ELEMENTWISE_MUL) + UNSQUEEZE2 fuse pass and general oneDNN FC optimizations. Unfortunately PaddlePaddle's develop branch is broken and my PR: #47391 cannot pass the CI. Also our 6271C machine is having some problems with firewall and I cannot download Eigen and build the newest version and prepare any measurements. Could you please test this branch for three options: OneDNN int8, OneDNN FP32 (without oneDNN FC), OneDNN FP32 (with OneDNN FC) and send the measurements here? |
@jakpiase @yaomichael I test quant model CER under 6148, but the result is poor, which is: CER
This script to run quant model: the quantization model: Using this script to compute CER:
Under 6271C, CER is also poor, 19.75%(only using FC quant, int8) when I check onednn log, I find these. Should binary using jit?
inner_product logs, what's the diff with gemm_s8u8s32 and gemm_s8s8s32
|
I found a new 6271C machine last night and the test results are as follows:
|
@yeliang2258, I have uploaded a fix for that 5th case without oneDNN FC into https://github.com/jakpiase/Paddle/tree/temporary_u2_perf branch. @zh794390558 the difference between |
When will choose to use u8 or s8 for weight, does it has some hint? Should u8 for src, s8 for weight?
How can we debug with int8 kernel to find which causes this precision error? |
Hi, @yaomichael @jczaja The following PR is to fix the accuracy of the picodet. After the PR was merged, we found that the accuracy of the picodet was only 29%, indicating that the problem of accuracy has not been completely fixed. |
In OneDNN weights can be only s8, and input can be s8 or u8. UINT8 input is set when before FC is some activations that return positive values like ReLU. So this gemm_s8u8s32 `u8` show input data type. But maybe it should be verified if out data types for weights are correct. |
I look through compute_propagate_scales_mkldnn_pass.cc, does the UpdateScaleOpInScale implement right? What's the function of the method? and why not div Should it be look like this?
|
For the CER of the quantization model, I have tested under develop of remove compute_propagate_scales_mkldnn_pass from https://github.com/jakpiase/Paddle/tree/temporary_u2_perf , cer can be 5.7% |
This PR PaddlePaddle/Paddle#47391 with constant folding merged with develop, CER can be 5.8% |
@jakpiase Please update your paddle branch, because we found that the accuracy of quantized model is expected when using the paddle develop branch. |
Fixed by this: PaddlePaddle/Paddle#47574
|
@jczaja @wozna @yeliang2258 I add a squeeze2+transpose2 fuse for onednn PaddlePaddle/Paddle#47592, please have a look.
6271C machine FP32 Int8 : cherry-pick 2.4 PaddlePaddle/Paddle#47712 |
@zh794390558 , @yeliang2258 I tried to compare pytorch u2++ model vs PaddlePaddle u2++ model to check if models are the same. For example I checked number of convolutions and shapes and number of matrix multiplication operations and shapes.
|
https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/s2t/models/u2/u2.py#L856 |
I guess I found the main int8 problem, because not all fc ops were quantized even if they all have scales. It improves U2++ int8 model:
I prepared PR PaddlePaddle/Paddle#47780 with an explanation what was wrong. Please check results with this PR. I tested it on my CPX machine
|
@zh794390558 , @yeliang2258 , @yaomichael This is one of recent profilings of U2++ / PaddlePaddle on CLX. You can see that right now second operator that is taking most of time is conditional_block_infer . This operator is having Executor inside that is executed its child operators (intended rows on profiling).
Problem is that this ops that belong to conditional_block_infer operator they are not a subject to IR passes. So they cannot be executed by oneDNN kernels neither they are subject to fuses. Baidu engineers were working on that problem: PaddlePaddle/Paddle#17003 , but it was never solved. To speed up U2++ a bit more it would be good to have this issue fixed . Could you please resume work on enabling IR passes for operators that are part of conditional_block_infer? |
Overhead analysis (paddle int8 vs pytorch int8)One of elements that contribute to performance gap of int8 execution is bigger overhead in case of paddle. U2++ Paddle int8 Overhead is around : 23.5% Details of comparisonFlamegraphs were generated for both paddle and pytorch when processing full data set (as given in instruction). Paddle U2++ int8Pytorch U2++ int8From flamegraphs we can get overhead :
For Paddle we have also profiling from the same machine and the same experiment which shows that Paddle overhead is 22.28%:
Conclusion:One of the elements responsible for poorer performance of U2++ Paddle int8 is bigger framework overhead (~23%) than in case of pytorch (9.8%). So perhaps further work on reducing overhead is important and needed. |
@yeliang2258 will focus on this problem. |
Another question, how to generate flame graph? |
@zh794390558
Basic example:To profile linux command Command to produce flamegraph:
Now, as Github Markup is blocking some SVG scripts. You should download this file and then open it in your web browser (firefox and chrome would work) for inspection. Produced example does have "unknown" blocks which is because we miss debug symbols of profiled workload. Preparing workloadFor PaddlePaddle we would need to build as Optimized but with debug sysmbols . For example: Now we need a perf profiler and software to make a flamegraph out of profiling made by perf. Installing perf and flamegraphThere are two methods here:
The next thing is that oneDNN is generating assembly code in runtime e.g. JIT code. And profiling of this kind of code was added to perf / linux kernel a bit later. So we need recent linux kernel Operating system requirementsIt works fine on Centos8+ and ubuntu 18.04+ . For other OS'es anything providing Linux Kernel 5.0+ should be fine . We also need to customize perf to annotate JIT code and to let know oneDNN that annotations are for perf format. Running U2++ pytorch to get flamegraph
Legend:-
running perf introduce 5-10% overhead to execution, but also generation of flaemgraph may take very long time if Generating flamegraph using scriptsThis is a method of flaemgraph generation using scripts rather than Rust based flamegraph crate
|
Thank you very much for the guide. |
latest result from yeliang
|
I have the same like error like you described here when I run
cmake -B build
https://github.com/zh794390558/paddle_build/blob/main/test/u2/patch/README.md. How can I resolve that?The text was updated successfully, but these errors were encountered: