MatMulNBits + Add fusion #20587

edgchen1 · 2024-05-07T00:13:42Z

Description

Add MatMulNBits Bias input
Add graph transformer to fuse MatMulNBits + Add

Motivation and Context

Improve performance.

Measurements

Phi-2 int4 model using onnxruntime_perf_test on x64 machine.

With ORT profiling enabled:

Baseline:

------ Top CPU Kernel Times ------
                  name  duration   pct  count  cumulative_pct  cumulative_dur
           MatMulNBits   3458639 74.28  19300           74.28         3458639
                   Add    479686 10.30  19600           84.59         3938325
    MultiHeadAttention    207680  4.46   3200           89.05         4146005
       RotaryEmbedding    190082  4.08   6400           93.13         4336087
              FastGelu    125999  2.71   3200           95.84         4462086
SkipLayerNormalization    105812  2.27   3200           98.11         4567898
             Unsqueeze     17475  0.38    900           98.48         4585373
                Gather     13065  0.28    600           98.76         4598438
                Concat     10644  0.23    400           98.99         4609082
                 Where      9325  0.20    400           99.19         4618407
                 Shape      7823  0.17    400           99.36         4626230
                Expand      4330  0.09    200           99.45         4630560
                  Cast      4079  0.09    200           99.54         4634639
                 Equal      3962  0.09    200           99.63         4638601
    LayerNormalization      3344  0.07    100           99.70         4641945
                 Slice      2482  0.05    100           99.75         4644427
                   Sub      2253  0.05    100           99.80         4646680
                 Range      1914  0.04    100           99.84         4648594
               Reshape      1892  0.04    100           99.88         4650486
                  Less      1851  0.04    100           99.92         4652337
               Squeeze      1836  0.04    100           99.96         4654173
       ConstantOfShape      1817  0.04    100          100.00         4655990

Updated:

------ Top CPU Kernel Times ------
                  name  duration   pct  count  cumulative_pct  cumulative_dur
           MatMulNBits   3446419 81.21  19300           81.21         3446419
    MultiHeadAttention    203500  4.80   3200           86.00         3649919
       RotaryEmbedding    194088  4.57   6400           90.58         3844007
              FastGelu    123723  2.92   3200           93.49         3967730
SkipLayerNormalization    104530  2.46   3200           95.96         4072260
                   Add     83163  1.96   3500           97.92         4155423
             Unsqueeze     17609  0.41    900           98.33         4173032
                Gather     13474  0.32    600           98.65         4186506
                Concat     10648  0.25    400           98.90         4197154
                 Where      9162  0.22    400           99.12         4206316
                 Shape      7687  0.18    400           99.30         4214003
                Expand      4265  0.10    200           99.40         4218268
                  Cast      4213  0.10    200           99.50         4222481
                 Equal      4040  0.10    200           99.59         4226521
    LayerNormalization      3324  0.08    100           99.67         4229845
                 Slice      2497  0.06    100           99.73         4232342
                   Sub      2228  0.05    100           99.78         4234570
               Reshape      1909  0.04    100           99.83         4236479
                 Range      1880  0.04    100           99.87         4238359
                  Less      1855  0.04    100           99.91         4240214
               Squeeze      1844  0.04    100           99.96         4242058
       ConstantOfShape      1804  0.04    100          100.00         4243862

Average inferences/sec without profiling enabled:
Baseline: 29.6533
Updated: 30.521

…_bias

…in minimal extended build

…_bias

…xruntime into edgchen1/matmul_nbits_bias

onnxruntime/test/testdata/transform/runtime_optimization/matmulnbits_add_gen.py

+import sys
+
+import numpy as np
+import onnx


onnxruntime/core/optimizer/graph_transformer_utils.cc

onnxruntime/test/optimizer/graph_transform_test.cc

…mer.

skottmckay

edgchen1 added 25 commits April 29, 2024 19:38

save work

350f3e2

handle bias in fallback impl

752b7c9

reorder includes

4121ff4

save work - initial impl of optimizer

96fa18b

Merge remote-tracking branch 'origin/main' into edgchen1/matmul_nbits…

11c3b92

…_bias

Add test, refine impl.

baab944

Merge remote-tracking branch 'origin/main' into edgchen1/matmul_nbits…

1568758

…_bias

fix fusion, update test

6402ea5

Enable bias in SQNBitGemm benchmark.

6a1feb2

clean up

a6c42e4

move test functions around to avoid unused function warnings

7fa290b

add /bigobj for graph_tarnsform_test.cc

b5057eb

lint

823c9db

Add Neural Speed build to post-merge-jobs.yml.

7555f92

move graph_utils::IsSupportedProvider() in header so it is available …

10ba1db

…in minimal extended build

put MatMulNBitsBiasFusion test in contrib ops ifdef

471429c

update ContribOperators.md

7f07d93

call RunWithConfig instead of Run

97a15a1

update OperatorKernels.md

905cbd2

add matmul_nbits_fusion files to extended minimal build source

0abc2cd

Merge remote-tracking branch 'origin/main' into edgchen1/matmul_nbits…

1a4b5a0

…_bias

Make DML MatMulNBits input check accomodate new inputs.

50cb266

expect input arg counts of 0 for missing optional inputs

131141f

Merge branch 'edgchen1/matmul_nbits_bias' of github.com:microsoft/onn…

2417b80

…xruntime into edgchen1/matmul_nbits_bias

add runtime optimization test for matmulnbits and add fusion

1d9c2f5

edgchen1 marked this pull request as ready for review May 10, 2024 00:56

edgchen1 requested a review from a team as a code owner May 10, 2024 00:56

github-advanced-security bot found potential problems May 10, 2024

View reviewed changes

onnxruntime/test/testdata/transform/runtime_optimization/matmulnbits_add_gen.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems May 10, 2024

View reviewed changes

onnxruntime/test/testdata/transform/runtime_optimization/matmulnbits_add_gen.py Fixed Show fixed Hide fixed

edgchen1 requested a review from skottmckay May 10, 2024 01:03

edgchen1 requested review from liqunfu and yufenglee May 10, 2024 01:03

edgchen1 added 2 commits May 9, 2024 18:48

add test onnx file

afacb57

lint

55851e3

github-advanced-security bot found potential problems May 10, 2024

View reviewed changes

yufenglee reviewed May 14, 2024

View reviewed changes

onnxruntime/core/optimizer/graph_transformer_utils.cc Show resolved Hide resolved

yufenglee reviewed May 14, 2024

View reviewed changes

onnxruntime/test/optimizer/graph_transform_test.cc Show resolved Hide resolved

edgchen1 added 2 commits May 14, 2024 18:21

Add !ORT_NEURAL_SPEED ifdefs around adding MatMulNBitsFusion transfor…

badcd0f

…mer.

add ifdef around test

b90c8ca

yufenglee previously approved these changes May 15, 2024

View reviewed changes

Merge branch 'main' into edgchen1/matmul_nbits_bias

92ce6eb

edgchen1 dismissed yufenglee’s stale review via 92ce6eb May 15, 2024 23:54

skottmckay approved these changes May 16, 2024

View reviewed changes

edgchen1 merged commit e81c867 into main May 16, 2024
93 of 96 checks passed

edgchen1 deleted the edgchen1/matmul_nbits_bias branch May 16, 2024 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MatMulNBits + Add fusion #20587

MatMulNBits + Add fusion #20587

edgchen1 commented May 7, 2024 •

edited

skottmckay left a comment

MatMulNBits + Add fusion #20587

MatMulNBits + Add fusion #20587

Conversation

edgchen1 commented May 7, 2024 • edited

Description

Motivation and Context

Measurements

skottmckay left a comment

Choose a reason for hiding this comment

edgchen1 commented May 7, 2024 •

edited