Object reuse for Convolution #1113

cfRod · 2021-07-08T10:32:45Z

Hi,

I wanted to know if it is possible to disable object reuse for the convolution operator as per (https://docs.oneapi.com/versions/latest/onednn/dev_guide_primitive_cache.html).

I built oneDNN reference on AArch64 to demonstrate this. Here, the convolution with two different input slices but with same configuration (filter sizes, input sizes etc) will reuse the same primitive regardless of the user setting DNNL_ENABLE_PRIMITIVE_CACHE OFF at build time or setting the run time control DNNL_PRIMITIVE_CACHE_CAPACITY to 0.

dnnl_verbose,create:cache_miss,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.0358887
dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,8.51099
dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,8.41187

I set

set(DNNL_ENABLE_PRIMITIVE_CACHE OFF CACHE INTERNAL "" FORCE)
  message(STATUS "${DNNL_ENABLE_PRIMITIVE_CACHE}")

This should set the DNNL_PRIMITIVE_CACHE_CAPACITY internally to 0. However for the convolution operator the code path that checks for cache hits and misses is never called again?

static status_t create_primitive_common(

The text was updated successfully, but these errors were encountered:

igorsafo · 2021-07-09T00:28:28Z

Hi @cfRod ,
Could you please share the example?
From what I see in the logs you shared convolution primitive object is created once and then executed twice:

    void app::execute_2_convs() {
        ...
        auto conv_prim = convolution_forward(conv_pd);
        ...
        conv_prim.execute(engine_stream, conv_args1);
        conv_prim.execute(engine_stream, conv_args2);
    }

Such case is not relevant to primitive cache, because the same convolution primitive object is reused and there is no need for cache. Cache is needed when application/framework can't keep oneDNN primitive objects between executions and recreates it before each execution:

    void app::execute_conv() { // conv will be destroyed once execute_conv is finished
        ...
        auto conv_prim = convolution_forward(conv_pd);
        ...
        conv_prim.execute(engine_stream, conv_args);
    }
    void app::execute_2_convs() {
        execute_conv();
        execute_conv(); // if cache is enabled conv object will be recreated much faster due to cached information
    }

Regards,
Igor Safonov

cfRod · 2021-07-09T07:59:20Z

Hi @igorsafo ,
Thanks for the explanation. We've seen an issue for a test case in MxNET for oneDNN+ACL apache/mxnet#20265

and the specific test case is https://github.com/apache/incubator-mxnet/blob/4e3a4c903f8befd81a0e90c29484c17ced35c519/tests/python/unittest/test_operator.py#L1707
Update: Editing code snippet to show that I am running num groups = 2.

def test_convolution_grouping():
    for dim in [2]:
        num_filter = 4
        for num_group in [2]:
            kernel = (3,) * dim
            shape = (1, 4) + (9,) * dim

            x = mx.sym.Variable('x')
            w = mx.sym.Variable('w')
            b = mx.sym.Variable('b')
            y1 = mx.sym.Convolution(data=x, weight=w, bias=b, num_filter=num_filter, num_group=num_group, kernel=kernel)
            xslice = mx.sym.SliceChannel(data=x, num_outputs=num_group, axis=1)
            wslice = mx.sym.SliceChannel(data=w, num_outputs=num_group, axis=0)
            bslice = mx.sym.SliceChannel(data=b, num_outputs=num_group, axis=0)
            y2 = mx.sym.Concat(*[mx.sym.Convolution(data=xslice[i], weight=wslice[i], bias=bslice[i],
                                                    num_filter=num_filter//num_group, kernel=kernel)
                            for i in range(num_group)])

            exe1 = y1._simple_bind(default_context(), x=shape)
            exe2 = y2._simple_bind(default_context(), x=shape, w=(num_filter, shape[1]//num_group) + kernel, b=(num_filter,))
            for arr1, arr2 in zip(exe1.arg_arrays, exe2.arg_arrays):
                arr1[:] = np.float32(np.random.normal(size=arr1.shape))
                arr2[:] = arr1
            exe1.forward(is_train=True)
            exe1.backward(exe1.outputs[0])
            exe2.forward(is_train=True)
            exe2.backward(exe2.outputs[0])

            for arr1, arr2 in zip(exe1.outputs + exe1.grad_arrays, exe2.outputs + exe2.grad_arrays):
                np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)

Here, the first convolution is done by Gemm reference as it is not supported by ACL and then the two subsequent convolution with different input and weight slices is done by ACL. The results from ACL are incorrect for the second slice and it is down to a limitation in ACL as explained in the issue.

Is it possible to prevent object reuse in oneDNN (1 create and followed by two execs) and create a new primitive every time or from your comment it seems this is controlled from the framework/application?

TaoLv · 2021-07-09T15:44:19Z

I remember there is a primitive object cache on mxnet side which was implemented before the feature was officially supported in the library.

igorsafo · 2021-07-09T15:45:27Z

Interesting, thanks for the details. According to your code snippet convolution should be recreated at the second iteration of for num_group in [1, 2]:, but in this case oneDNN verbose should print one more verbose print dnnl_verbose,create:cache_hit,cpu,convolution,..., is it possible that MxNet has its own cache for layers so oneDNN convolution stays alive between iterations? If that is the case then cache should be disabled on framework side. As far as I understand the issue is not related to oneDNN cache mechanism, because convolution is not recreated with cache_hit (this is the sign a primitive is reused). Please be aware that disabling cache on the framework side will affect both Intel and ARM CPUs so it might bring performance degradation.

Another strange behavior you described in the mxnet issue which is weights are immutable and don't get change between conv::execute() calls:

However Compute library assumes weights are immutable, which is because it does some preprocessing internally of the weights once for every new primitive creation and but in this case the second time around the new sets of weights are not used by ACL.

In the source code I see that gemm-based implementation sets appropriate weights memory buffer each execution call:

oneDNN/src/cpu/aarch64/acl_gemm_convolution.cpp

Line 48 in 3b28c52

acl_obj.wei_tensor.allocator()->import_memory(

Anyway regardless of the workaround seems like we need to find an issue in ACL integration into oneDNN and fix it to avoid workarounds in every framework. Casting @nSircombe and @alenik01 as developers of ACL integration.

igorsafo · 2021-07-09T16:22:23Z

@cfRod @TaoLv on a second thought, does it mean that MxNet has a bug in their hashing function in cache? How is it possible that an object for a convolution with number of groups equal to 1 is reused for a convolution with number of groups equal to 2? These convolution should have different memory descriptors for weights and implementations.

cfRod · 2021-07-09T18:42:05Z

@igorsafo, I don't think MxNET has a bug in their hashing function. As you can see my logs in the mxnet issue (putting it here again for reference). I should have amended the code to show that I had fixed num groups to 2. Adding it here:

def test_convolution_grouping():
    for dim in [2]:
        num_filter = 4
        for num_group in [2]:
            kernel = (3,) * dim
            shape = (1, 4) + (9,) * dim

            x = mx.sym.Variable('x')
            w = mx.sym.Variable('w')
            b = mx.sym.Variable('b')
            y1 = mx.sym.Convolution(data=x, weight=w, bias=b, num_filter=num_filter, num_group=num_group, kernel=kernel)
            xslice = mx.sym.SliceChannel(data=x, num_outputs=num_group, axis=1)
            wslice = mx.sym.SliceChannel(data=w, num_outputs=num_group, axis=0)
            bslice = mx.sym.SliceChannel(data=b, num_outputs=num_group, axis=0)
            y2 = mx.sym.Concat(*[mx.sym.Convolution(data=xslice[i], weight=wslice[i], bias=bslice[i],
                                                    num_filter=num_filter//num_group, kernel=kernel)
                            for i in range(num_group)])

            exe1 = y1.simple_bind(default_context(), x=shape)
            exe2 = y2.simple_bind(default_context(), x=shape, w=(num_filter, shape[1]//num_group) + kernel, b=(num_filter,))
            for arr1, arr2 in zip(exe1.arg_arrays, exe2.arg_arrays):
                arr1[:] = np.float32(np.random.normal(size=arr1.shape))
                arr2[:] = arr1
                print("GEMM Ref input")
                print (arr1)
                print("ACL input")
                print (arr2)

            exe1.forward(is_train=True)
            exe1.backward(exe1.outputs[0])
            exe2.forward(is_train=True)
            exe2.backward(exe2.outputs[0])
            print("GEMM OUTPUT")
            print(exe1.outputs)
            print("ACL OUTPUT")
            print(exe2.outputs)

            for arr1, arr2 in zip(exe1.outputs, exe2.outputs):
                np.testing.assert_allclose(arr1.asnumpy(), arr2.asnumpy(), rtol=1e-3, atol=1e-3)

/home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated.
[14:52:42] /home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated.
dnnl_verbose,info,oneDNN v2.0.0 (commit 83ebc40d86bc54f0f23e947235e53570eeacf254)
dnnl_verbose,info,cpu,runtime:OpenMP
dnnl_verbose,info,cpu,isa:Generic
dnnl_verbose,info,gpu,runtime:none

# THIS IS GEMM REFFERENCE

dnnl_verbose,create:cache_miss,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.130859
dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,19.759

# THIS IS GEMM ACL with sliced inputs
dnnl_verbose,create:cache_miss,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.0969238
dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.415039
dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.321045

# THIS IS THE CONCAT STAGE
dnnl_verbose,create:cache_miss,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.0700684
dnnl_verbose,exec,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.468994

It create two objects, one for when num groups = 2 and gemm reference is called because ACL does not support it and the second one for num groups = 1, when a convolution is created and passed with two different slices of inputs and weights.

@nSircombe and @alenik01 are in the same team as me :D.

cfRod · 2021-07-09T18:48:10Z

is it possible that MxNet has its own cache for layers so oneDNN convolution stays alive between iterations

Yes, I think this is the case.

igorsafo · 2021-07-09T19:03:22Z

@nSircombe and @alenik01 are in the same team as me :D.

My bad, got it :)

It create two objects, one for when num groups = 2 and gemm reference is called because ACL does not support it and the second one for num groups = 1, when a convolution is created and passed with two different slices of inputs and weights.

Thanks, this helps. Is slicing an in-place operation in this case? Does convolution work with dense slices or it works with strided memory which points to the original buffer of a variable (x, w, b)? In the first case the same convolution can be reused, in the second case the convolutions should be different.

Regards,
Igor Safonov

cfRod · 2021-07-09T19:04:12Z

Another strange behavior you described in the mxnet issue which is weights are immutable and don't get change between conv::execute() calls:

From what I understand ACL internally calls a reshape function https://github.com/ARM-software/ComputeLibrary/blob/f7399fd616cc8ccedfae668da3023e589a7ef558/src/runtime/NEON/functions/NEGEMMConvolutionLayer.cpp#L618 and it does this step the first time around. The second call to the same convolution it does not and this causes the issue. We had a similar problem with TensorFlow caching (tensorflow/tensorflow#47415)

cfRod · 2021-07-09T19:16:33Z

..we need to find an issue in ACL integration into oneDNN and fix it to avoid workarounds in every framework

Agreed. I initially thought it was on the oneDNN side but thanks for your explanation. It made it clearer.

igorsafo · 2021-07-09T19:29:22Z

Another strange behavior you described in the mxnet issue which is weights are immutable and don't get change between conv::execute() calls:

From what I understand ACL internally calls a reshape function https://github.com/ARM-software/ComputeLibrary/blob/f7399fd616cc8ccedfae668da3023e589a7ef558/src/runtime/NEON/functions/NEGEMMConvolutionLayer.cpp#L618 and it does this step the first time around. The second call to the same convolution it does not and this causes the issue.

I see, so the convolution object is tied to a particular weights memory buffer. I don't think we can control it using oneDNN API. in oneDNN primitive is immutable during execution and all buffers are passed as arguments to it, but seems like ACL keeps some context.

As a workaround it can be disabled on the framework side by creating new oneDNN convolution every time for MxNet convolution operation (for ARM architecture to avoid regressions on Intel side).
As a long term solution it would be helpful if ACL provides an external API to prepack weights. As a result we could handle that reorder on oneDNN implementation side by invoking that reorder for convolution weights if they are changed.

// pseudo code of src/cpu/aarch64/acl_gemm_convolution.cpp
status_t acl_gemm_convolution_fwd_t<src_type, wei_type, dst_type,
        bia_type>::execute_forward(const exec_ctx_t &ctx) const {
    auto new_weights = CTX_IN_MEM(const wei_data_t *, DNNL_ARG_WEIGHTS);
    auto prev_weights_storage =  ctx.get_resource_mapper()->get<acl_resource_t>(this)->prev_weights;
    auto prev_weights = *prev_weights_storage;
    if (new_weights != prev_weights) {
        ACL->prepare_weights(new_weights);
        *prev_weights_storage = new_weights;
    }
    ACL->convolution();
}

igorsafo · 2021-07-09T19:39:19Z

Actually it can be fixed on ACL side as well by checking the incoming weights pointer and doing the reordering inside ACL in case it is different. This will be easier since there will be no need in new ACL API and changes on oneDNN side.

cfRod · 2021-07-13T13:38:17Z

@igorsafo, thanks for the information. We do have a plan to fix this in ACL but until then we will work on a workaround as a temporary solution. I think fixing mxnet hashing to avoid reuse for Arm will have to do in this case. I think this issue can be closed now.
Thanks!

cfRod added the question label Jul 8, 2021

igorsafo self-assigned this Jul 9, 2021

cfRod changed the title ~~Primitive caching for Convolution~~ Object reuse for Convolution Jul 9, 2021

mgouicem closed this as completed Jul 13, 2021

cfRod mentioned this issue Sep 6, 2021

[BUGFIX] Fix accuracy issue due to reuse of primitives for MKLDNN-AArch64. Fixes #20265. apache/mxnet#20482

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object reuse for Convolution #1113

Object reuse for Convolution #1113

cfRod commented Jul 8, 2021 •

edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 9, 2021 •

edited

TaoLv commented Jul 9, 2021

igorsafo commented Jul 9, 2021 •

edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 9, 2021 •

edited

cfRod commented Jul 9, 2021 •

edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 9, 2021 •

edited

cfRod commented Jul 9, 2021

igorsafo commented Jul 9, 2021 •

edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 13, 2021

Object reuse for Convolution #1113

Object reuse for Convolution #1113

Comments

cfRod commented Jul 8, 2021 • edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 9, 2021 • edited

TaoLv commented Jul 9, 2021

igorsafo commented Jul 9, 2021 • edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 9, 2021 • edited

cfRod commented Jul 9, 2021 • edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 9, 2021 • edited

cfRod commented Jul 9, 2021

igorsafo commented Jul 9, 2021 • edited

igorsafo commented Jul 9, 2021

cfRod commented Jul 13, 2021

cfRod commented Jul 8, 2021 •

edited

cfRod commented Jul 9, 2021 •

edited

igorsafo commented Jul 9, 2021 •

edited

cfRod commented Jul 9, 2021 •

edited

cfRod commented Jul 9, 2021 •

edited

cfRod commented Jul 9, 2021 •

edited

igorsafo commented Jul 9, 2021 •

edited