Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change warning for placeholder size to exception #1143

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

Dark-Rock
Copy link

@Dark-Rock Dark-Rock commented Oct 13, 2022

Fix #1141.

Introduces a new behavior version (#508).

@Dark-Rock Dark-Rock requested review from albertz and a team as code owners October 13, 2022 09:34
@albertz
Copy link
Member

albertz commented Oct 13, 2022

Thanks for the PR.

There are some issues with it. As a starting point for contributing, see here: https://github.com/rwth-i6/returnn/blob/master/CONTRIBUTING.md

  • The syntax style does not follow PEP8 and other standard Python conventions, which are a requirement here in the repo.
  • I'm not sure we really need a dedicated exception type here. This seems like over-engineering and not really useful or required to me.
  • This must introduce a new behavior version (Ideas for RETURNN behaviour changes #508), which is lacking here. See similar previous issues and PRs which introduced a new behavior version (same label). For example: Recover relaxed behavior, strict with new behavior version #1144, batch norm, new defaults #899, Fix #666, broadcast no longer matches dims #864
  • It also changes the behavior more than what was said in the issue, i.e. just making the warning an exception, and nothing else. This likely causes other things to break. See the failing test cases.
  • Fix failing test cases. The tests will run anyway with the latest behavior version. We need to investigate whether these are some problems which should be fixed, or whether there are any valid failures, and in that case, we need to discuss what to do about them.

@albertz
Copy link
Member

albertz commented Oct 13, 2022

Can you see the output of the failing tests?

For example:

  returnn/tf/util/data.py:21: WEAK WARNING Pep8CodeStyle: E303: too many blank lines (3)
  returnn/tf/util/data.py:990: WEAK WARNING Pep8CodeStyle: E501: line too long (171 > 120 characters)
  returnn/exceptions/dimension_exception.py:1: WEAK WARNING PyMissingOrEmptyDocstringInspection: Missing docstring
  returnn/exceptions/dimension_exception.py:2: WEAK WARNING Pep8CodeStyle: W291: trailing whitespace
  returnn/exceptions/dimension_exception.py:3: WEAK WARNING Pep8CodeStyle: W291: trailing whitespace
  returnn/exceptions/dimension_exception.py:12: WEAK WARNING Pep8CodeStyle: E231: missing whitespace after ','
  returnn/exceptions/dimension_exception.py:12: WEAK WARNING PyPep8NamingInspection: Argument name should be lowercase
  returnn/exceptions/dimension_exception.py:12: WEAK WARNING PyPep8NamingInspection: Argument name should be lowercase
  returnn/exceptions/dimension_exception.py:16: WARNING PyCompatibilityInspection: Python version 2.7 does not support this syntax. super() should have arguments in Python 2
  returnn/exceptions/dimension_exception.py:17: WEAK WARNING Pep8CodeStyle: W292: no newline at end of file
  returnn/exceptions/dimension_exception.py:17: WEAK WARNING Pep8CodeStyle: W293: blank line contains whitespace

@Dark-Rock
Copy link
Author

Thanks for the feedback, working on it.

@albertz
Copy link
Member

albertz commented Oct 13, 2022 via email

@albertz
Copy link
Member

albertz commented Oct 13, 2022

See my updated comment above.

As an example on the test cases:

def test_Data_get_common_data_extra_static_spatial():
  d1 = Data(name='t', shape=(None, 32, 128), dtype='float32', auto_create_placeholders=True)
  d2 = Data(name='r', shape=(None, 32, 128), dtype='float32', auto_create_placeholders=True)
  d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
  common = Data.get_common_data([d1, d2])
  assert d1.shape == common.shape

So it makes sense that this fails now because d1 and d2 both have placeholders for their dynamic spatial dim.

It might work if you just remove the auto_create_placeholders=True. I'm not sure. You need to try.

This is a tricky example. We need to understand what this was actually testing. I think this was specifically testing the behavior for some older setups. If my suggestion above did not work, you could also wrap this all up in sth like:

orig_behavior_version = BehaviorVersion._behavior_version
try:
  BehaviorVersion._behavior_version = 0
  ...  # now the test
finally:
  BehaviorVersion._behavior_version = orig_behavior_version

However, we should be hesitant to use this too often. If possible, we should always update the test case properly. Only if we are pretty sure the test is there to ensure some old behavior, some old setup or so keeps working, and updating the test otherwise would then miss some crucial parts of the test, only then I think it is valid to use this.

@Dark-Rock
Copy link
Author

Dark-Rock commented Oct 14, 2022

I'm running into this issue when trying to execute tests. I'm working on a Mac M1, any idea how I can fix it ?

Executing: test_FastBaumWelch
Make op...
WARNING: OpMaker: no BLAS lib found
OpCodeCompiler: <OpCodeCompiler 'FastBaumWelchOp' in '/var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5'>
OpCodeCompiler: No need to recompile: /var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5/FastBaumWelchOp.so
EXCEPTION
Traceback (most recent call last):
  File "/Users/darkrock/returnn/tests/test_TFNativeOp.py", line 3191, in <module>
    line: v()
    locals:
      v = <local> <function test_FastBaumWelch at 0x15ebc9c60>
  File "/Users/darkrock/returnn/tests/test_TFNativeOp.py", line 1541, in test_FastBaumWelch
    line: op = make_fast_baum_welch_op(compiler_opts=dict(verbose=True))  # will be cached, used inside :func:`fast_baum_welch`
    locals:
      op = <not found>
      make_fast_baum_welch_op = <global> <function make_fast_baum_welch_op at 0x15e68e7a0>
      compiler_opts = <not found>
      dict = <builtin> <class 'dict'>
      verbose = <not found>
  File "/Users/darkrock/returnn/returnn/tf/native_op.py", line 1237, in make_fast_baum_welch_op
    line: return maker.make_op()
    locals:
      maker = <local> <returnn.tf.native_op.OpMaker object at 0x15eb6e650>
      maker.make_op = <local> <bound method OpMaker.make_op of <returnn.tf.native_op.OpMaker object at 0x15eb6e650>>
  File "/Users/darkrock/returnn/returnn/tf/native_op.py", line 561, in OpMaker.make_op
    line: mod = self._make_mod()
    locals:
      mod = <not found>
      self = <local> <returnn.tf.native_op.OpMaker object at 0x15eb6e650>
      self._make_mod = <local> <bound method OpMaker._make_mod of <returnn.tf.native_op.OpMaker object at 0x15eb6e650>>
  File "/Users/darkrock/returnn/returnn/tf/native_op.py", line 548, in OpMaker._make_mod
    line: mod = comp.load_tf_module()
    locals:
      mod = <not found>
      comp = <local> <OpCodeCompiler 'FastBaumWelchOp' in '/var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5'>
      comp.load_tf_module = <local> <bound method OpCodeCompiler.load_tf_module of <OpCodeCompiler 'FastBaumWelchOp' in '/var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5'>>
  File "/Users/darkrock/returnn/returnn/tf/util/basic.py", line 2916, in OpCodeCompiler.load_tf_module
    line: self._tf_mod = tf.load_op_library(self._so_filename)
    locals:
      self = <local> <OpCodeCompiler 'FastBaumWelchOp' in '/var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5'>
      self._tf_mod = <local> None
      tf = <global> <module 'tensorflow' from '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tensorflow/__init__.py'>
      tf.load_op_library = <global> <function load_op_library at 0x15beb45e0>
      self._so_filename = <local> '/var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5/FastBaumWelchOp.so', len = 129
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library
    line: lib_handle = py_tf.TF_LoadLibrary(library_filename)
    locals:
      lib_handle = <not found>
      py_tf = <global> <module 'tensorflow.python.client.pywrap_tf_session' from '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/tensorflow/python/client/pywrap_tf_session.py'>
      py_tf.TF_LoadLibrary = <global> <built-in method TF_LoadLibrary of PyCapsule object at 0x120c08a20>
      library_filename = <local> '/var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5/FastBaumWelchOp.so', len = 129
NotFoundError: dlopen(/var/folders/tp/cmf8n7117pvg392c0mg3ycn40000gn/T//darkrock/returnn_tf_cache/ops/FastBaumWelchOp/07d8c452e5/FastBaumWelchOp.so, 0x0006): symbol not found in flat namespace (__ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl12lts_2022062311string_viewENSt3__110unique_ptrINS0_15OpKernelFactoryENS8_14default_deleteISA_EEEE

@albertz
Copy link
Member

albertz commented Oct 14, 2022

You don't need to run all the tests. You can also just check them here in the issue, or otherwise only run test_TFUtil.py, test_TFEngine.py, test_TFNetworkLayer.py, test_TFNetworkRecLayer.py.

@Dark-Rock
Copy link
Author

It seems like we have to go for that implementation

orig_behavior_version = BehaviorVersion._behavior_version
try:
  BehaviorVersion._behavior_version = 0
  ...  # now the test
finally:
  BehaviorVersion._behavior_version = orig_behavior_version

Although, I'm not familiar with the code base. Should I had this section to every failing test ?

@albertz
Copy link
Member

albertz commented Oct 14, 2022

It seems like we have to go for that implementation

Why? As I said, I really would try to avoid this. Did you try my other suggestion with removing auto_create_placeholders=True? I think that should also work. Or if not, what is the error you get?

@Dark-Rock
Copy link
Author

When I try to run the code with the changes you suggested this is the error that I get trying to run test_TFUtil.py

Executing: test_Data_copy_template_excluding_time_dim_two_time_dims
EXCEPTION
Traceback (most recent call last):
  File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 3941, in <module>
    line: v()
    locals:
      v = <local> <function test_Data_copy_template_excluding_time_dim_two_time_dims at 0x165fe1c60>
  File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 202, in test_Data_copy_template_excluding_time_dim_two_time_dims
    line: assert set(data.size_placeholder.keys()) == {0, 1}
    locals:
      set = <builtin> <class 'set'>
      data = <local> Data{'ref_att_weights_output', [B?,T|'time:var-unk:ref_att_weights_output'[?],'spatial1:var-unk:ref_att_weights_output'[?],F|F'feature:ref_att_weights_output'(1)]}
      data.size_placeholder = <local> {}, len = 0
      data.size_placeholder.keys = <local> <bound method _SizePlaceholderProxy.keys of {}>
AssertionError

@albertz
Copy link
Member

albertz commented Oct 14, 2022

My suggested changes were just for the function test_Data_get_common_data_extra_static_spatial. You are talking now about a different function. You should not change it there.

@Dark-Rock
Copy link
Author

If I do not change it there, then the exception implemented is raised when running the tests.

RequirementNotSatisfied: Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data (required since behavior_version >= 14)

@albertz
Copy link
Member

albertz commented Oct 15, 2022 via email

@Dark-Rock
Copy link
Author

Executing: test_Data_get_common_data_extra2_static_spatial
EXCEPTION
Traceback (most recent call last):
  File "/Users/dark-rock/returnn/tests/test_TFUtil.py", line 3941, in <module>
    line: v()
    locals:
      v = <local> <function test_Data_get_common_data_extra2_static_spatial at 0x1530c1ea0>
  File "/Users/dark-rock/returnn/tests/test_TFUtil.py", line 696, in test_Data_get_common_data_extra2_static_spatial
    line: d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
    locals:
      d2 = <local> Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),'spatial2:static32:r'(32),F|F'feature:r'(128)]}
      d2.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),'spatial2:static32:r'(32),F|F'feature:r'(128)]}>
      declare_same_as = <not found>
      d1 = <local> Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),'spatial2:static32:t'(32),F|F'feature:t'(128)]}
      d1.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),'spatial2:static32:t'(32),F|F'feature:t'(128)]}>
  File "/Users/dark-rock/returnn/returnn/tf/util/data.py", line 987, in Dim.declare_same_as
    line: BehaviorVersion.require(False,"Dim tags are same with different size placeholders: %r vs %r please check external_data" % (self.dyn_size, other_same_base.dyn_size),14)
    locals:
      BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
      BehaviorVersion.require = <global> <bound method BehaviorVersion.require of <class 'returnn.util.basic.BehaviorVersion'>>
      self = <local> Dim{'time:var:extern_data:t'[B?]}
      self.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32>
      other_same_base = <local> Dim{'time:var:extern_data:t'[B?]}
      other_same_base.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32>
  File "/Users/dark-rock/returnn/returnn/util/basic.py", line 302, in BehaviorVersion.require
    line: raise BehaviorVersion.RequirementNotSatisfied(
            "%s (required since behavior_version >= %i)" % (message, version))
    locals:
      BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
      BehaviorVersion.RequirementNotSatisfied = <global> <class 'returnn.util.basic.BehaviorVersion.RequirementNotSatisfied'>
      message = <local> "Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data", len = 239
      version = <local> 14
RequirementNotSatisfied: Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data (required since behavior_version >= 14)

@albertz
Copy link
Member

albertz commented Oct 15, 2022

But I was only speaking about test_Data_get_common_data_extra_static_spatial.

@Dark-Rock
Copy link
Author

Here's the stacktrace

EXCEPTION
Traceback (most recent call last):
  File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 3932, in <module>
    line: v()
    locals:
      v = <local> <function test_Data_get_common_data_extra_static_spatial at 0x1688c1630>
  File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 672, in test_Data_get_common_data_extra_static_spatial
    line: d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
    locals:
      d2 = <local> Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}
      d2.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'r', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}>
      declare_same_as = <not found>
      d1 = <local> Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}
      d1.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'t', [B?,T|'time:var:extern_data:t'[B?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}>
  File "/Users/darkrock/returnn/returnn/tf/util/data.py", line 987, in Dim.declare_same_as
    line: BehaviorVersion.require(False,"Dim tags are same with different size placeholders: %r vs %r please check external_data" % (self.dyn_size, other_same_base.dyn_size),14)
    locals:
      BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
      BehaviorVersion.require = <global> <bound method BehaviorVersion.require of <class 'returnn.util.basic.BehaviorVersion'>>
      self = <local> Dim{'time:var:extern_data:t'[B?]}
      self.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32>
      other_same_base = <local> Dim{'time:var:extern_data:t'[B?]}
      other_same_base.dyn_size = <local> <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32>
  File "/Users/darkrock/returnn/returnn/util/basic.py", line 302, in BehaviorVersion.require
    line: raise BehaviorVersion.RequirementNotSatisfied(
            "%s (required since behavior_version >= %i)" % (message, version))
    locals:
      BehaviorVersion = <global> <class 'returnn.util.basic.BehaviorVersion'>
      BehaviorVersion.RequirementNotSatisfied = <global> <class 'returnn.util.basic.BehaviorVersion.RequirementNotSatisfied'>
      message = <local> "Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data", len = 239
      version = <local> 14
RequirementNotSatisfied: Dim tags are same with different size placeholders: <tf.Tensor 'extern_data/placeholders/r/r_dim0_size:0' shape=(?,) dtype=int32> vs <tf.Tensor 'extern_data/placeholders/t/t_dim0_size_1:0' shape=(?,) dtype=int32> please check external_data (required since behavior_version >= 14)

@albertz
Copy link
Member

albertz commented Oct 15, 2022

My suggestion was: In test_Data_get_common_data_extra_static_spatial (and only there), remove the (both) auto_create_placeholders=True. What happens then?

@Dark-Rock
Copy link
Author

Here's the stacktrace

Executing: test_Data_get_common_data_extra_static_spatial
EXCEPTION
Traceback (most recent call last):
  File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 3933, in <module>
    line: v()
    locals:
      v = <local> <function test_Data_get_common_data_extra_static_spatial at 0x1551c1cf0>
  File "/Users/darkrock/returnn/tests/test_TFUtil.py", line 672, in test_Data_get_common_data_extra_static_spatial
    line: d2.get_size_dim_tag(0).declare_same_as(d1.get_size_dim_tag(0))
    locals:
      d2 = <local> Data{'r', [B?,T|'time:var-unk:r'[?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}
      d2.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'r', [B?,T|'time:var-unk:r'[?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}>
      declare_same_as = <not found>
      d1 = <local> Data{'t', [B?,T|'time:var-unk:t'[?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}
      d1.get_size_dim_tag = <local> <bound method Data.get_size_dim_tag of Data{'t', [B?,T|'time:var-unk:t'[?],'spatial1:static32:t'(32),F|F'feature:t'(128)]}>
  File "/Users/darkrock/returnn/returnn/tf/util/data.py", line 5415, in Data.get_size_dim_tag
    line: axis_wo_batch = sorted(self.size_placeholder.keys())[number]
    locals:
      axis_wo_batch = <not found>
      sorted = <builtin> <built-in function sorted>
      self = <local> Data{'r', [B?,T|'time:var-unk:r'[?],'spatial1:static32:r'(32),F|F'feature:r'(128)]}
      self.size_placeholder = <local> {}, len = 0
      self.size_placeholder.keys = <local> <bound method _SizePlaceholderProxy.keys of {}>
      number = <local> 0
IndexError: list index out of range

albertz added a commit that referenced this pull request Oct 15, 2022
Also fixes Dim.is_dynamic. Related: #1139

Also needed for #1143.
@albertz
Copy link
Member

albertz commented Oct 15, 2022

Sorry, that was an unrelated bug in Dim.get_size_dim_tag. I just fixed this via #1148. Can you rebase or merge to current master, and then try again?

@albertz
Copy link
Member

albertz commented Oct 19, 2022

I think you did not correctly merge or rebase. GitHub still complains: "This branch cannot be rebased due to conflicts"

albertz added a commit that referenced this pull request Oct 19, 2022
Don't first create a new size placeholder
and then later call declare_same_as.

Esp this is required
when declare_same_as becomes stricter (#1143).
albertz added a commit that referenced this pull request Oct 19, 2022
Don't first create a new size placeholder and then later call declare_same_as.

Esp this is required when declare_same_as becomes stricter (#1143).

Fix wrong batch info:

The dim tag could have an old invalid batch info.
E.g. the global batch_dim when it comes from an old run.

If we really need this, we should validate the dim tag first.

But probably it's better to remove it and clean it up.

Engine reset global batch dim.
@albertz
Copy link
Member

albertz commented Oct 19, 2022

Should I also remove the auto_create_placeholders in the data.__init__ ?

No this will be a slightly more complicated change. I will probably do that in a separate PR. You need to wait, or try yourself if you like, but this is more difficult.

I now merged #1159 which should fix this.

Can you rebase?

@albertz
Copy link
Member

albertz commented Oct 19, 2022

So the remaining failing tests are:

  • test_CombineLayer_two_time_dims
  • test_CombineLayer_two_time_dims_first_not_most_generic
  • test_CombineLayer_two_time_dims_first_not_most_generic_with_n_out
  • test_search_multi_choice_hdf_dump
  • test_reclayer_optimize_out_accum_loop_dyn_size
  • test_extra_scatter_nd_search_train

@albertz
Copy link
Member

albertz commented Oct 19, 2022

Probably those need to be adopted. E.g. in test_CombineLayer_two_time_dims:

    in0 = Data(
      name="in0", shape=(None, None, n_dim), batch_dim_axis=1, auto_create_placeholders=True)
    in1 = Data(
      # same time as first in in0
      name="in1", shape=(None, n_dim), auto_create_placeholders=True)
    in2 = Data(
      # same time as in second in in0
      name="in2", shape=(None, n_dim), batch_dim_axis=1, auto_create_placeholders=True)
    extern_data.register_data(in0)
    extern_data.register_data(in1)
    extern_data.register_data(in2)
    in1.get_size_dim_tag(0).declare_same_as(in0.get_size_dim_tag(0))
    in2.get_size_dim_tag(0).declare_same_as(in0.get_size_dim_tag(1))

Change this to:

    in0 = Data(
      name="in0", shape=(None, None, n_dim), batch_dim_axis=1, auto_create_placeholders=True)
    in1 = Data(
      # same time as first in in0
      name="in1", dim_tags=[in0.dim_tags[i] for i in (1, 0, 3)], auto_create_placeholders=True)
    in2 = Data(
      # same time as in second in in0
      name="in2", dim_tags=[in0.dim_tags[i] for i in (2, 1, 3)], auto_create_placeholders=True)
    extern_data.register_data(in0)
    extern_data.register_data(in1)
    extern_data.register_data(in2)

@albertz
Copy link
Member

albertz commented Oct 19, 2022

In test_CombineLayer_two_time_dims_first_not_most_generic and test_CombineLayer_two_time_dims_first_not_most_generic_with_n_out it should be the same change.

@albertz
Copy link
Member

albertz commented Oct 20, 2022

I haven't looked too much into the other test cases yet but from a first glance, I think they are a bit more difficult. It might be that this even detects now some error which we have not noticed before. E.g. in test_search_multi_choice_hdf_dump, we get:

returnn.util.basic.BehaviorVersion.RequirementNotSatisfied: Dim tags are same with different size placeholders (<tf.Tensor '0_t_target/add_1:0' shape=(?,) dtype=int32> vs <tf.Tensor 't_linear/add:0' shape=(?,) dtype=int32>), please check external_data (required since behavior_version >= 15)

The size placeholders really look different. They should not become the same.
This is called via:

  File "/home/runner/work/returnn/returnn/returnn/tf/layers/base.py", line 475, in LayerBase._post_init_output
    line: output.size_placeholder = cls._static_get_target_value(
            target=size_target,
            _target_layers=_target_layers,
            search_choices=_src_common_search_choices,
            network=network, mark_data_key_as_used=network.eval_flag).size_placeholder.copy()
    locals:
      output = <local> Data{'t_linear_output', [B,T|'time:var:extern_data:classes+1'[B]], dtype='int32', sparse_dim=Dim{F't_linear_output:sparse-dim'[?]}}
      output.size_placeholder = <local> {0: <tf.Tensor '0_t_target/add_1:0' shape=(?,) dtype=int32>}, len = 1
      size_target = <local> 'classes0', len = 8

So size_target was specified. But why is output.size_placeholder already set? (Btw, the log output is misleading, because the dim tags are already merged at that point. Maybe we should change that as well.)

@@ -998,12 +998,11 @@ def declare_same_as(self, other):
if self.dyn_size is not None and other_same_base.dyn_size is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this whole block of code (if self.dyn_size is not None and other_same_base.dyn_size is not None: ...) a few lines up, just before the line if self_same_as is not self: ...?

This should make the error message and stacktrace more clear because then you still see the dim tags separated and not yet merged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trace is pretty big but I think this is the most relevant part:

Finished training in epoch 2.
Run 3: Train results:
{1: EpochData(learningRate=0.01, error={
'dev_error': 0.6000000089406967,
'dev_score': 1.0964271708790534,
'train_error': 0.700000010430813,
'train_score': 1.0993032428518419,
}),
 2: EpochData(learningRate=0.01, error={
'dev_error': 0.6000000089406967,
'dev_score': 1.0963989421188671,
'train_error': 0.6600000098347665,
'train_score': 1.0985237666744554,
})}
Run 3: Forward cv seq 0:
[1 2 0 1 2]
Epoch 1, error key 'dev_error', current value 0.600000 vs prev value 0.600000, equal?
Epoch 1, error key 'dev_score', current value 1.096427 vs prev value 1.096427, equal?
Epoch 1, error key 'train_error', current value 0.700000 vs prev value 0.680000, equal?
EXCEPTION
Traceback (most recent call last):
  File "/Users/darkrock/returnn/tests/test_TFEngine.py", line 4406, in <module>
    line: v()
    locals:
      v = <local> <function test_rec_subnet_auto_optimize at 0x1658a9120>
  File "/Users/darkrock/returnn/tests/test_TFEngine.py", line 2603, in test_rec_subnet_auto_optimize
    line: run(run_idx=3, optimize_move_layers_out=True)
    locals:
      run = <local> <function test_rec_subnet_auto_optimize.<locals>.run at 0x168c6df30>
      run_idx = <not found>
      optimize_move_layers_out = <not found>
  File "/Users/darkrock/returnn/tests/test_TFEngine.py", line 2594, in test_rec_subnet_auto_optimize.<locals>.run
    line: numpy.testing.assert_almost_equal(error_value, prev_error_value, decimal=3)
    locals:
      numpy = <global> <module 'numpy' from '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/__init__.py'>
      numpy.testing = <global> <module 'numpy.testing' from '/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/testing/__init__.py'>
      numpy.testing.assert_almost_equal = <global> <function assert_almost_equal at 0x165877ac0>
      error_value = <local> 0.700000010430813
      prev_error_value = <local> 0.6800000101327897
      decimal = <not found>
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 599, in assert_almost_equal
    line: raise AssertionError(_build_err_msg())
    locals:
      AssertionError = <builtin> <class 'AssertionError'>
      _build_err_msg = <local> <function assert_almost_equal.<locals>._build_err_msg at 0x168c6de10>
AssertionError: 
Arrays are not almost equal to 3 decimals
 ACTUAL: 0.700000010430813
 DESIRED: 0.6800000101327897

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly understand your comment. Did you do this change? I don't see it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the change, didn't commit it yet. I'll push it so you can see the stacktrace in the pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't see this stacktrace/error you posted here in the pipeline?

@@ -983,6 +983,14 @@ def declare_same_as(self, other):
if self_derived_bases.issubset(other_derived_bases):
# Avoid cycles on derived_from_tag. https://github.com/rwth-i6/returnn/issues/1054
return other.declare_same_as(self)
if self.dyn_size is not None and other_same_base.dyn_size is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very strange that moving this up now causes different behavior, i.e. namely that some tests are passing now which were not passing before.

I now wonder whether the check here is actually not so correct, or maybe just incomplete.

The _maybe_update call below might update self.dyn_size. But self.dyn_size is also only set if self.batch is properly set. In case self.batch is None, which is totally valid, self.dyn_size is likely also not set.

What we actually want to test is: If there are any combination of batch and ctx where both self and other are defined (via get_for_batch_ctx logic), and those dyn_size is different, then we raise this exception.

I will come up with some code for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, my first update also did not yield the previous behavior. I am now trying again sth more.

returnn/tf/util/data.py Outdated Show resolved Hide resolved
returnn/tf/util/data.py Outdated Show resolved Hide resolved
returnn/tf/util/data.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants