Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TF - Fix] Fix imports from tensorflow.python.keras with tf.__version__ >= 2.6.0 #3403

Merged
merged 9 commits into from Feb 26, 2022

Conversation

chongxiaoc
Copy link
Collaborator

@chongxiaoc chongxiaoc commented Feb 7, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

See Tensorflow 2.6.0 release notes for Keras package change:

tf.keras:

Keras been split into a separate PIP package (keras), and its code has been moved to the GitHub repositorykeras-team/keras. The API endpoints for tf.keras stay unchanged, but are now backed by the keras PIP package. The existing code in tensorflow/python/keras is a staled copy and will be removed in future release (2.7). Please remove any imports to tensorflow.python.keras and replace them with public tf.keras API instead.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented Feb 8, 2022

Unit Test Results

     766 files   -   7       766 suites   - 7   9h 0m 7s ⏱️ - 6m 52s
     722 tests ±  0       674 ✔️  -   1       47 💤 ±    0  1 +1 
16 599 runs   - 94  11 802 ✔️ +26  4 796 💤  - 121  1 +1 

For more details on these failures, see this check.

Results for commit 201c0a9. ± Comparison against base commit 046c071.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Feb 8, 2022

Unit Test Results (with flaky tests)

     936 files  +  61       936 suites  +61   9h 48m 48s ⏱️ - 14m 57s
     722 tests ±    0       670 ✔️  -     3       47 💤 ±    0    5 +3 
19 985 runs  +921  14 117 ✔️ +793  5 858 💤 +121  10 +7 

For more details on these failures, see this check.

Results for commit 201c0a9. ± Comparison against base commit 046c071.

♻️ This comment has been updated with latest results.

@chongxiaoc
Copy link
Collaborator Author

chongxiaoc commented Feb 8, 2022

I will check the failed test locally, for example (test-cpu-gloo-py3_7-tf2_7_0-keras2_7_0-torch1_10_1-mxnet1_9_0-pyspark2_4_8)

test_spark_keras.py::SparkKerasTests::test_model_serialization FAILED

@chongxiaoc chongxiaoc marked this pull request as draft February 8, 2022 00:46
@chongxiaoc
Copy link
Collaborator Author

I can reproduce the failure locally and get it fixed by pinning h5py version as below:

pip install 'h5py==2.10.0' --force-reinstall

SO post: https://stackoverflow.com/questions/53740577/does-any-one-got-attributeerror-str-object-has-no-attribute-decode-whi

@chongxiaoc chongxiaoc marked this pull request as ready for review February 18, 2022 22:14
@chongxiaoc
Copy link
Collaborator Author

It looks like CI should increase timeout for SparkTests.test_spark_task_service_abort_command

GRACEFUL_TERMINATION_TIME_S by default is 5.0

=================================== FAILURES ===================================
_______________ SparkTests.test_spark_task_service_abort_command _______________
self = <test_spark.SparkTests testMethod=test_spark_task_service_abort_command>

    def test_spark_task_service_abort_command(self):
        with spark_task_service(index=0) as (service, client, _):
            with tempdir() as d:
                file = os.path.sep.join([d, 'command_executed'])
                sleep = safe_shell_exec.GRACEFUL_TERMINATION_TIME_S * 2 + 2.0
    
                start = time.time()
                client.run_command('set -x; sleep {} && touch {}'.format(sleep, file), {})
                client.abort_command()
                client.wait_for_command_termination(delay=0.1)
                duration = time.time() - start
    
>               self.assertLess(duration, safe_shell_exec.GRACEFUL_TERMINATION_TIME_S + 1.0)
E               AssertionError: 10.332[77](https://github.com/horovod/horovod/runs/5256700638?check_suite_focus=true#step:160:77)17[78](https://github.com/horovod/horovod/runs/5256700638?check_suite_focus=true#step:160:78)10669 not less than 6.0

test_spark.py:16[80](https://github.com/horovod/horovod/runs/5256700638?check_suite_focus=true#step:160:80): AssertionError

@chongxiaoc chongxiaoc force-pushed the test_3157 branch 3 times, most recently from 554e7ab to 0cfdfe3 Compare February 23, 2022 01:59
@chongxiaoc chongxiaoc marked this pull request as draft February 23, 2022 01:59
@chongxiaoc chongxiaoc marked this pull request as ready for review February 23, 2022 17:42
@chongxiaoc
Copy link
Collaborator Author

Create issue #3417 to track known failed tests on CI, though I cannot reproduce them locally.

@EnricoMi
Copy link
Collaborator

@chongxiaoc looks good, you are planning to fix the Spark 2.4.x issues in this PR, right?

@chongxiaoc
Copy link
Collaborator Author

@chongxiaoc looks good, you are planning to fix the Spark 2.4.x issues in this PR, right?

@EnricoMi
Kind of blocked at this moment, I found out a linux instance but the test on Spark 2.4.8 is not failing there.
Is there a way I can just download CI docker image to test?

@EnricoMi
Copy link
Collaborator

Run this to build the docker image locally:

docker-compose -f docker-compose.test.yml build test-cpu-gloo-py3_7-tf2_7_0-keras2_7_0-torch1_10_1-mxnet1_9_0-pyspark2_4_8

@chongxiaoc
Copy link
Collaborator Author

Finally got the reproducer in the docker container:
Full log: https://gist.github.com/chongxiaoc/a43f8e1b7c02728bac6b73f52c60646c

Error stack:

[1]<stderr>:Traceback (most recent call last):
[1]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[1]<stderr>:    "__main__", mod_spec)
[1]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
[1]<stderr>:    exec(code, run_globals)
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 30, in <module>
[1]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 23, in main
[1]<stderr>:    task_exec(driver_addresses, settings, 'HOROVOD_RANK', 'HOROVOD_LOCAL_RANK')
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/__init__.py", line 60, in task_exec
[1]<stderr>:    fn, args, kwargs = driver_client.code()
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/driver/driver_service.py", line 245, in code
[1]<stderr>:    resp = self._send(CodeRequest())
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 303, in _send
[1]<stderr>:    return self._send_one(addr, req, stream)
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 279, in _send_one
[1]<stderr>:    resp = self._wire.read(rfile)
[1]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:Traceback (most recent call last):
[1]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[1]<stderr>:struct.error: unpack requires a buffer of 4 bytes
[0]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[0]<stderr>:    "__main__", mod_spec)
[0]<stderr>:  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
[0]<stderr>:    exec(code, run_globals)
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 30, in <module>
[0]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/gloo_exec_fn.py", line 23, in main
[0]<stderr>:    task_exec(driver_addresses, settings, 'HOROVOD_RANK', 'HOROVOD_LOCAL_RANK')
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/task/__init__.py", line 60, in task_exec
[0]<stderr>:    fn, args, kwargs = driver_client.code()
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/spark/driver/driver_service.py", line 245, in code
[0]<stderr>:    resp = self._send(CodeRequest())
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 303, in _send
[0]<stderr>:    return self._send_one(addr, req, stream)
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 279, in _send_one
[0]<stderr>:    resp = self._wire.read(rfile)
[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[0]<stderr>:struct.error: unpack requires a buffer of 4 bytes

@chongxiaoc
Copy link
Collaborator Author

@EnricoMi @tgaddair
Any thoughts about the error of unpack in horovod/runner/common/util/network.py above.

I can reproduce the error in the CI docker container, but outside container using PySpark 2.4.8 the tests didn't fail.

The error seems unfamiliar to me and related to decode/code in Horovod and Spark 2.

I suggest to land this PR to unblock TF 2.6+ support, since Spark 2 is a minor config in our CI. What do you think?

@EnricoMi
Copy link
Collaborator

EnricoMi commented Feb 24, 2022

[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[0]<stderr>:struct.error: unpack requires a buffer of 4 bytes

That only tells you that the driver failed to read from the workers. The workers all fail with:

  File "/usr/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/dist-packages/dill/_dill.py", line 1226, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

@chongxiaoc
Copy link
Collaborator Author

[0]<stderr>:  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 95, in read
[0]<stderr>:    message_len = struct.unpack('i', rfile.read(4))[0]
[0]<stderr>:struct.error: unpack requires a buffer of 4 bytes

That only tells you that the driver failed to read from the workers. The workers all fail with:

  File "/usr/lib/python3.7/pickle.py", line 504, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/lib/python3.7/dist-packages/dill/_dill.py", line 1226, in save_cell
    f = obj.cell_contents
ValueError: Cell is empty

So worker failed with pickle and wiring objects? Not sure why it only happens in Spark 2 config.

  File "/usr/local/lib/python3.7/dist-packages/horovod/runner/common/util/network.py", line 131, in handle
    server._wire.write(resp, self.wfile)

@chongxiaoc
Copy link
Collaborator Author

chongxiaoc commented Feb 24, 2022

Okay, I got it fixed in local docker CI image (at least for TF 2.6.2).
cloudpickle conflicts with dill and I have to pin cloudpickle to 1.3.0
Similar issue was reported here: cloudpipe/cloudpickle#393

(hvd-py3.7) chongxiaoc@chongxiaoc-C02DL7WPMD6R horovod % git diff .
diff --git a/Dockerfile.test.cpu b/Dockerfile.test.cpu
index ffa8854f..bbe616ed 100644
--- a/Dockerfile.test.cpu
+++ b/Dockerfile.test.cpu
@@ -84,6 +84,13 @@ RUN if [[ ${SPARK_PACKAGE} != *"-preview"* ]]; then \
         (cd /spark/python && python setup.py sdist && pip install --no-cache-dir dist/pyspark-*.tar.gz && rm dist/pyspark-*); \
     fi

+# Pin cloudpickle to 1.3.0
+# Dill breaks clouldpickle > 1.3.0 when using Spark2
+# https://github.com/cloudpipe/cloudpickle/issues/393
+RUN if [[ ${PYSPARK_PACKAGE} == "pyspark==2."* ]]; then \
+        pip install --no-cache-dir cloudpickle==1.3.0; \
+    fi
+
 # Install Ray.
 # Updating to 1.7.0 to pass ray tests
 RUN pip install --no-cache-dir ray==1.7.0
diff --git a/docker-compose.test.yml b/docker-compose.test.yml
index 83121b8a..36e50204 100644
--- a/docker-compose.test.yml
+++ b/docker-compose.test.yml
@@ -104,6 +104,8 @@ services:
       args:
         # Tensorflow 1.15.5 is only available for Python 3.7
         # Python 3.7 is only available on Ubuntu 18.04
+        TENSORFLOW_PACKAGE: tensorflow-cpu==2.6.2
+        KERAS_PACKAGE: keras==2.6.0
         UBUNTU_VERSION: 18.04
         PYTHON_VERSION: 3.7
         PYSPARK_PACKAGE: pyspark==2.4.8
diff --git a/test/integration/test_spark_keras.py b/test/integration/test_spark_keras.py
index 2bfad1c8..8be04c1a 100644
--- a/test/integration/test_spark_keras.py
+++ b/test/integration/test_spark_keras.py
@@ -23,7 +23,6 @@ from distutils.version import LooseVersion

 import mock
 import numpy as np
-import pyspark
 import pytest
 import tensorflow as tf

@@ -79,9 +78,6 @@ class SparkKerasTests(tf.test.TestCase):
         logging.getLogger('py4j.java_gateway').setLevel(logging.INFO)
         warnings.simplefilter('module')

-    @pytest.mark.skipif(LooseVersion(tf.__version__) >= LooseVersion('2.6.0') and
-                        LooseVersion(pyspark.__version__) < LooseVersion('3.0.0'),
-                        reason="https://github.com/horovod/horovod/issues/3417")
     def test_fit_model(self):
         model = create_xor_model()
         optimizer = tf.keras.optimizers.SGD(lr=0.1)
@@ -110,9 +106,6 @@ class SparkKerasTests(tf.test.TestCase):
                 assert len(pred) == 1
                 assert pred.dtype == np.float32

-    @pytest.mark.skipif(LooseVersion(tf.__version__) >= LooseVersion('2.6.0') and
-                        LooseVersion(pyspark.__version__) < LooseVersion('3.0.0'),
-                        reason="https://github.com/horovod/horovod/issues/3417")
     def test_fit_model_multiclass(self):
         model = create_mnist_model()
         optimizer = tf.keras.optimizers.Adadelta(1.0)

@chongxiaoc
Copy link
Collaborator Author

It looks like TF head and Lightning head are breaking the CI but I think that's irrelevant to this PR.

@chongxiaoc
Copy link
Collaborator Author

@tgaddair Added pytorch_lightning==1.3.8 into setup.py

@chongxiaoc
Copy link
Collaborator Author

chongxiaoc commented Feb 25, 2022

TF nightly failures are tracked here: #3422

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, but good!

test/parallel/test_tensorflow.py Show resolved Hide resolved
@JoostvDoorn
Copy link

@chongxiaoc was pinning pytorch_lightning to 1.3.8 really necessary here? It would be great if the setup is a bit more flexible such that we can use a more recent version of lightning.

@EnricoMi
Copy link
Collaborator

Lets see what works in our CI: #3480

@chongxiaoc
Copy link
Collaborator Author

chongxiaoc commented Mar 21, 2022

@JoostvDoorn When testing this PR, I remember the incompatibility only showed up for lighting head ~1.5.10 and later.
Since horovod_plugin in lightning has been updated for a long time, we decided to pin a long time stable version here.
I think lightning 1.4 should be working as well.

@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 21, 2022

@chongxiaoc any chance we can get this working for 1.5.10? We have to move beyond 1.5.10 at some point, any way.

@chongxiaoc
Copy link
Collaborator Author

@EnricoMi 1.4.0 or 1.5.0 should be good.

@EnricoMi
Copy link
Collaborator

@JoostvDoorn Horovod works fine with pytorch_lightning up to 1.5.9, see #3480. In which way does the 1.3.8 affect you? Are you referring to the released images on docker hub?

@JoostvDoorn
Copy link

@EnricoMi I was installing horovod through pypi, and this will downgrade pytorch_lightning because of the line pytorch_require_list = ['torch', 'pytorch_lightning==1.3.8']. I can install horovod in develop mode to avoid this restriction, but it seemed unusual to take such strict measures. It may be better to update the require list to ['torch', 'pytorch_lightning>=1.3.8,!=1.5.10']. On the pytorch lightning side I suspect the issues will be fixed in the next version, there are some PRs fixing issues open now.

@EnricoMi
Copy link
Collaborator

@JoostvDoorn I see, what about this: 58bc4bb

I suspect pytorch_lightning>1.5.10 won't work either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants