[TF - Fix] Fix imports from tensorflow.python.keras with tf.version >= 2.6.0 #3403

chongxiaoc · 2022-02-07T22:32:07Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

See Tensorflow 2.6.0 release notes for Keras package change:

tf.keras:

Keras been split into a separate PIP package (keras), and its code has been moved to the GitHub repositorykeras-team/keras. The API endpoints for tf.keras stay unchanged, but are now backed by the keras PIP package. The existing code in tensorflow/python/keras is a staled copy and will be removed in future release (2.7). Please remove any imports to tensorflow.python.keras and replace them with public tf.keras API instead.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

github-actions · 2022-02-08T00:38:08Z

Unit Test Results

    766 files -   7     766 suites - 7 9h 0m 7s ⏱️ - 6m 52s
    722 tests ±  0     674 ✔️ -   1     47 💤 ±    0 1 ❌ +1
16 599 runs - 94 11 802 ✔️ +26 4 796 💤 - 121 1 ❌ +1

For more details on these failures, see this check.

Results for commit 201c0a9. ± Comparison against base commit 046c071.

♻️ This comment has been updated with latest results.

github-actions · 2022-02-08T00:38:22Z

Unit Test Results (with flaky tests)

    936 files +  61     936 suites +61 9h 48m 48s ⏱️ - 14m 57s
    722 tests ±    0     670 ✔️ -     3     47 💤 ±    0   5 ❌ +3
19 985 runs +921 14 117 ✔️ +793 5 858 💤 +121 10 ❌ +7

For more details on these failures, see this check.

Results for commit 201c0a9. ± Comparison against base commit 046c071.

♻️ This comment has been updated with latest results.

chongxiaoc · 2022-02-08T00:46:00Z

I will check the failed test locally, for example (test-cpu-gloo-py3_7-tf2_7_0-keras2_7_0-torch1_10_1-mxnet1_9_0-pyspark2_4_8)

test_spark_keras.py::SparkKerasTests::test_model_serialization FAILED

chongxiaoc · 2022-02-18T21:36:45Z

I can reproduce the failure locally and get it fixed by pinning h5py version as below:

pip install 'h5py==2.10.0' --force-reinstall

SO post: https://stackoverflow.com/questions/53740577/does-any-one-got-attributeerror-str-object-has-no-attribute-decode-whi

chongxiaoc · 2022-02-22T19:42:29Z

It looks like CI should increase timeout for SparkTests.test_spark_task_service_abort_command

GRACEFUL_TERMINATION_TIME_S by default is 5.0

=================================== FAILURES ===================================
_______________ SparkTests.test_spark_task_service_abort_command _______________
self = <test_spark.SparkTests testMethod=test_spark_task_service_abort_command>

    def test_spark_task_service_abort_command(self):
        with spark_task_service(index=0) as (service, client, _):
            with tempdir() as d:
                file = os.path.sep.join([d, 'command_executed'])
                sleep = safe_shell_exec.GRACEFUL_TERMINATION_TIME_S * 2 + 2.0
    
                start = time.time()
                client.run_command('set -x; sleep {} && touch {}'.format(sleep, file), {})
                client.abort_command()
                client.wait_for_command_termination(delay=0.1)
                duration = time.time() - start
    
>               self.assertLess(duration, safe_shell_exec.GRACEFUL_TERMINATION_TIME_S + 1.0)
E               AssertionError: 10.332[77](https://github.com/horovod/horovod/runs/5256700638?check_suite_focus=true#step:160:77)17[78](https://github.com/horovod/horovod/runs/5256700638?check_suite_focus=true#step:160:78)10669 not less than 6.0

test_spark.py:16[80](https://github.com/horovod/horovod/runs/5256700638?check_suite_focus=true#step:160:80): AssertionError

See horovod#3417

chongxiaoc · 2022-02-23T19:14:05Z

Create issue #3417 to track known failed tests on CI, though I cannot reproduce them locally.

EnricoMi · 2022-02-23T20:41:19Z

@chongxiaoc looks good, you are planning to fix the Spark 2.4.x issues in this PR, right?

chongxiaoc · 2022-02-23T20:43:47Z

@chongxiaoc looks good, you are planning to fix the Spark 2.4.x issues in this PR, right?

@EnricoMi
Kind of blocked at this moment, I found out a linux instance but the test on Spark 2.4.8 is not failing there.
Is there a way I can just download CI docker image to test?

EnricoMi · 2022-02-23T20:48:36Z

Run this to build the docker image locally:

docker-compose -f docker-compose.test.yml build test-cpu-gloo-py3_7-tf2_7_0-keras2_7_0-torch1_10_1-mxnet1_9_0-pyspark2_4_8

chongxiaoc · 2022-02-23T22:34:28Z

Finally got the reproducer in the docker container:
Full log: https://gist.github.com/chongxiaoc/a43f8e1b7c02728bac6b73f52c60646c

Error stack:

[1]<stderr>:Traceback (most recent call last):
[1]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1]<stderr>:    "__main__", mod_spec)
[1]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
[1]<stderr>:    exec(code, run_globals)
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 30, in <module>
[1]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 23, in main
[1]<stderr>:    task_exec(driver_addresses, settings, 'HOROVOD_RANK', 'HOROVOD_LOCAL_RANK')
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/__init__.py", line 60, in task_exec
[1]<stderr>:    fn, args, kwargs = driver_client.code()
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/driver/driver_service.py", line 245, in code
[1]<stderr>:    resp = self._send(CodeRequest())
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 303, in _send
[1]<stderr>:    return self._send_one(addr, req, stream)
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 279, in _send_one
[1]<stderr>:    resp = self._wire.read(rfile)
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:Traceback (most recent call last):
[1]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[1]<stderr>:struct.error: unpack requires a buffer of 4 bytes
[0]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[0]<stderr>:    "__main__", mod_spec)
[0]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
[0]<stderr>:    exec(code, run_globals)
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 30, in <module>
[0]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 23, in main
[0]<stderr>:    task_exec(driver_addresses, settings, 'HOROVOD_RANK', 'HOROVOD_LOCAL_RANK')
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/__init__.py", line 60, in task_exec
[0]<stderr>:    fn, args, kwargs = driver_client.code()
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/driver/driver_service.py", line 245, in code
[0]<stderr>:    resp = self._send(CodeRequest())
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 303, in _send
[0]<stderr>:    return self._send_one(addr, req, stream)
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 279, in _send_one
[0]<stderr>:    resp = self._wire.read(rfile)
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[0]<stderr>:struct.error: unpack requires a buffer of 4 bytes

chongxiaoc · 2022-02-24T01:46:11Z

@EnricoMi @tgaddair
Any thoughts about the error of unpack in horovod/runner/common/util/network.py above.

I can reproduce the error in the CI docker container, but outside container using PySpark 2.4.8 the tests didn't fail.

The error seems unfamiliar to me and related to decode/code in Horovod and Spark 2.

I suggest to land this PR to unblock TF 2.6+ support, since Spark 2 is a minor config in our CI. What do you think?

horovod/common/util.py

EnricoMi · 2022-02-24T09:16:15Z

[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[0]<stderr>:struct.error: unpack requires a buffer of 4 bytes

That only tells you that the driver failed to read from the workers. The workers all fail with:

  File "/usr/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/dist-packages/dill/_dill.py", line 1226, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

chongxiaoc · 2022-02-24T18:42:27Z

[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[0]<stderr>:struct.error: unpack requires a buffer of 4 bytes

That only tells you that the driver failed to read from the workers. The workers all fail with:

  File "/usr/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/dist-packages/dill/_dill.py", line 1226, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

So worker failed with pickle and wiring objects? Not sure why it only happens in Spark 2 config.

  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 131, in handle
    server._wire.write(resp, self.wfile)

chongxiaoc · 2022-02-24T22:58:10Z

Okay, I got it fixed in local docker CI image (at least for TF 2.6.2).
cloudpickle conflicts with dill and I have to pin cloudpickle to 1.3.0
Similar issue was reported here: cloudpipe/cloudpickle#393

(hvd-py3.7) chongxiaoc@chongxiaoc-C02DL7WPMD6R horovod % git diff .
diff --git a/Dockerfile.test.cpu b/Dockerfile.test.cpu
index ffa8854f..bbe616ed 100644
--- a/Dockerfile.test.cpu
+++ b/Dockerfile.test.cpu
@@ -84,6 +84,13 @@ RUN if [[ ${SPARK_PACKAGE} != *"-preview"* ]]; then \
         (cd /spark/python && python setup.py sdist && pip install --no-cache-dir dist/pyspark-*.tar.gz && rm dist/pyspark-*); \
     fi

+# Pin cloudpickle to 1.3.0
+# Dill breaks clouldpickle > 1.3.0 when using Spark2
+# https://github.com/cloudpipe/cloudpickle/issues/393
+RUN if [[ ${PYSPARK_PACKAGE} == "pyspark==2."* ]]; then \
+        pip install --no-cache-dir cloudpickle==1.3.0; \
+    fi
+
 # Install Ray.
 # Updating to 1.7.0 to pass ray tests
 RUN pip install --no-cache-dir ray==1.7.0
diff --git a/docker-compose.test.yml b/docker-compose.test.yml
index 83121b8a..36e50204 100644
--- a/docker-compose.test.yml
+++ b/docker-compose.test.yml
@@ -104,6 +104,8 @@ services:
       args:
         # Tensorflow 1.15.5 is only available for Python 3.7
         # Python 3.7 is only available on Ubuntu 18.04
+        TENSORFLOW_PACKAGE: tensorflow-cpu==2.6.2
+        KERAS_PACKAGE: keras==2.6.0
         UBUNTU_VERSION: 18.04
         PYTHON_VERSION: 3.7
         PYSPARK_PACKAGE: pyspark==2.4.8
diff --git a/test/integration/test_spark_keras.py b/test/integration/test_spark_keras.py
index 2bfad1c8..8be04c1a 100644
--- a/test/integration/test_spark_keras.py
+++ b/test/integration/test_spark_keras.py
@@ -23,7 +23,6 @@ from distutils.version import LooseVersion

 import mock
 import numpy as np
-import pyspark
 import pytest
 import tensorflow as tf

@@ -79,9 +78,6 @@ class SparkKerasTests(tf.test.TestCase):
         logging.getLogger('py4j.java_gateway').setLevel(logging.INFO)
         warnings.simplefilter('module')

-    @pytest.mark.skipif(LooseVersion(tf.__version__) >= LooseVersion('2.6.0') and
-                        LooseVersion(pyspark.__version__) < LooseVersion('3.0.0'),
-                        reason="https://github.com/horovod/horovod/issues/3417")
     def test_fit_model(self):
         model = create_xor_model()
         optimizer = tf.keras.optimizers.SGD(lr=0.1)
@@ -110,9 +106,6 @@ class SparkKerasTests(tf.test.TestCase):
                 assert len(pred) == 1
                 assert pred.dtype == np.float32

-    @pytest.mark.skipif(LooseVersion(tf.__version__) >= LooseVersion('2.6.0') and
-                        LooseVersion(pyspark.__version__) < LooseVersion('3.0.0'),
-                        reason="https://github.com/horovod/horovod/issues/3417")
     def test_fit_model_multiclass(self):
         model = create_mnist_model()
         optimizer = tf.keras.optimizers.Adadelta(1.0)

chongxiaoc · 2022-02-25T05:32:49Z

It looks like TF head and Lightning head are breaking the CI but I think that's irrelevant to this PR.

chongxiaoc · 2022-02-25T21:36:51Z

@tgaddair Added pytorch_lightning==1.3.8 into setup.py

chongxiaoc · 2022-02-25T21:54:48Z

TF nightly failures are tracked here: #3422

EnricoMi

Minor comment, but good!

test/parallel/test_tensorflow.py

JoostvDoorn · 2022-03-19T09:15:57Z

@chongxiaoc was pinning pytorch_lightning to 1.3.8 really necessary here? It would be great if the setup is a bit more flexible such that we can use a more recent version of lightning.

EnricoMi · 2022-03-19T20:58:20Z

Lets see what works in our CI: #3480

chongxiaoc · 2022-03-21T16:38:51Z

@JoostvDoorn When testing this PR, I remember the incompatibility only showed up for lighting head ~1.5.10 and later.
Since horovod_plugin in lightning has been updated for a long time, we decided to pin a long time stable version here.
I think lightning 1.4 should be working as well.

EnricoMi · 2022-03-21T17:12:29Z

@chongxiaoc any chance we can get this working for 1.5.10? We have to move beyond 1.5.10 at some point, any way.

chongxiaoc · 2022-03-21T17:17:53Z

@EnricoMi 1.4.0 or 1.5.0 should be good.

EnricoMi · 2022-03-22T08:35:14Z

@JoostvDoorn Horovod works fine with pytorch_lightning up to 1.5.9, see #3480. In which way does the 1.3.8 affect you? Are you referring to the released images on docker hub?

JoostvDoorn · 2022-03-22T08:51:10Z

@EnricoMi I was installing horovod through pypi, and this will downgrade pytorch_lightning because of the line pytorch_require_list = ['torch', 'pytorch_lightning==1.3.8']. I can install horovod in develop mode to avoid this restriction, but it seemed unusual to take such strict measures. It may be better to update the require list to ['torch', 'pytorch_lightning>=1.3.8,!=1.5.10']. On the pytorch lightning side I suspect the issues will be fixed in the next version, there are some PRs fixing issues open now.

EnricoMi · 2022-03-22T10:52:36Z

@JoostvDoorn I see, what about this: 58bc4bb

I suspect pytorch_lightning>1.5.10 won't work either.

Fix Keras Import to load upstream Keras

a7849c7

chongxiaoc requested review from tgaddair, DEKHTIARJonathan and EnricoMi February 7, 2022 22:32

chongxiaoc mentioned this pull request Feb 7, 2022

[TF - Fix] Fix imports from tensorflow.python.keras with tf.__version__ >= 2.5.0 #3157

Closed

chongxiaoc marked this pull request as draft February 8, 2022 00:46

chongxiaoc marked this pull request as ready for review February 18, 2022 22:14

chongxiaoc force-pushed the test_3157 branch from b77cf6f to 414f744 Compare February 19, 2022 05:00

set tf2.6.0 as a switching point between keras and tensorflow.keras

0cfdfe3

chongxiaoc force-pushed the test_3157 branch 3 times, most recently from 554e7ab to 0cfdfe3 Compare February 23, 2022 01:59

chongxiaoc marked this pull request as draft February 23, 2022 01:59

test_spark_keras.py: skip known failures

022c4f3

See horovod#3417

chongxiaoc marked this pull request as ready for review February 23, 2022 17:42

chongxiaoc requested a review from Tixxx February 23, 2022 17:45

Tixxx reviewed Feb 24, 2022

View reviewed changes

horovod/common/util.py Show resolved Hide resolved

chongxiaoc added 2 commits February 24, 2022 14:58

Fix cloudpickle and dill compatbility in Spark2

66965b4

remove unused package

addf2f0

chongxiaoc force-pushed the test_3157 branch from 240c541 to 6e4b8f5 Compare February 25, 2022 08:45

Tixxx approved these changes Feb 25, 2022

View reviewed changes

chongxiaoc added 2 commits February 25, 2022 13:32

Skip failed unit tests for TF head, pin lightning to 1.3.8

e3407c9

Change the way of import optimizer_v2 for Keras>=2.9.0

d47fe91

chongxiaoc force-pushed the test_3157 branch from 6e4b8f5 to d47fe91 Compare February 25, 2022 21:35

EnricoMi approved these changes Feb 25, 2022

View reviewed changes

test/parallel/test_tensorflow.py Show resolved Hide resolved

chongxiaoc added 2 commits February 25, 2022 14:17

Update skipped tests comments

fb2bdc6

Add CHANGELOG

201c0a9

chongxiaoc merged commit 642a6b3 into horovod:master Feb 26, 2022

chongxiaoc mentioned this pull request Feb 27, 2022

test_spark_keras.py::SparkKerasTests fails in PySpark 2.4.8 and TF 2.7 in CI #3417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TF - Fix] Fix imports from tensorflow.python.keras with tf.version >= 2.6.0 #3403

[TF - Fix] Fix imports from tensorflow.python.keras with tf.version >= 2.6.0 #3403

chongxiaoc commented Feb 7, 2022 •

edited

github-actions bot commented Feb 8, 2022 •

edited

github-actions bot commented Feb 8, 2022 •

edited

chongxiaoc commented Feb 8, 2022 •

edited

chongxiaoc commented Feb 18, 2022

chongxiaoc commented Feb 22, 2022

chongxiaoc commented Feb 23, 2022

EnricoMi commented Feb 23, 2022

chongxiaoc commented Feb 23, 2022

EnricoMi commented Feb 23, 2022

chongxiaoc commented Feb 23, 2022

chongxiaoc commented Feb 24, 2022

EnricoMi commented Feb 24, 2022 •

edited

chongxiaoc commented Feb 24, 2022

chongxiaoc commented Feb 24, 2022 •

edited

chongxiaoc commented Feb 25, 2022

chongxiaoc commented Feb 25, 2022

chongxiaoc commented Feb 25, 2022 •

edited

EnricoMi left a comment

JoostvDoorn commented Mar 19, 2022

EnricoMi commented Mar 19, 2022

chongxiaoc commented Mar 21, 2022 •

edited

EnricoMi commented Mar 21, 2022 •

edited

chongxiaoc commented Mar 21, 2022

EnricoMi commented Mar 22, 2022

JoostvDoorn commented Mar 22, 2022

EnricoMi commented Mar 22, 2022

[TF - Fix] Fix imports from tensorflow.python.keras with tf.__version__ >= 2.6.0 #3403

[TF - Fix] Fix imports from tensorflow.python.keras with tf.__version__ >= 2.6.0 #3403

Conversation

chongxiaoc commented Feb 7, 2022 • edited

Checklist before submitting

Description

Review process to land

github-actions bot commented Feb 8, 2022 • edited

Unit Test Results

github-actions bot commented Feb 8, 2022 • edited

Unit Test Results (with flaky tests)

chongxiaoc commented Feb 8, 2022 • edited

chongxiaoc commented Feb 18, 2022

chongxiaoc commented Feb 22, 2022

chongxiaoc commented Feb 23, 2022

EnricoMi commented Feb 23, 2022

chongxiaoc commented Feb 23, 2022

EnricoMi commented Feb 23, 2022

chongxiaoc commented Feb 23, 2022

chongxiaoc commented Feb 24, 2022

EnricoMi commented Feb 24, 2022 • edited

chongxiaoc commented Feb 24, 2022

chongxiaoc commented Feb 24, 2022 • edited

chongxiaoc commented Feb 25, 2022

chongxiaoc commented Feb 25, 2022

chongxiaoc commented Feb 25, 2022 • edited

EnricoMi left a comment

Choose a reason for hiding this comment

JoostvDoorn commented Mar 19, 2022

EnricoMi commented Mar 19, 2022

chongxiaoc commented Mar 21, 2022 • edited

EnricoMi commented Mar 21, 2022 • edited

chongxiaoc commented Mar 21, 2022

EnricoMi commented Mar 22, 2022

JoostvDoorn commented Mar 22, 2022

EnricoMi commented Mar 22, 2022

[TF - Fix] Fix imports from tensorflow.python.keras with tf.version >= 2.6.0 #3403

[TF - Fix] Fix imports from tensorflow.python.keras with tf.version >= 2.6.0 #3403

chongxiaoc commented Feb 7, 2022 •

edited

github-actions bot commented Feb 8, 2022 •

edited

github-actions bot commented Feb 8, 2022 •

edited

chongxiaoc commented Feb 8, 2022 •

edited

EnricoMi commented Feb 24, 2022 •

edited

chongxiaoc commented Feb 24, 2022 •

edited

chongxiaoc commented Feb 25, 2022 •

edited

chongxiaoc commented Mar 21, 2022 •

edited

EnricoMi commented Mar 21, 2022 •

edited