Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make checkpoint name optional so that user can save to h5 format. #3411

Merged
merged 2 commits into from Mar 1, 2022

Conversation

irasit
Copy link
Collaborator

@irasit irasit commented Feb 18, 2022

Signed-off-by: Peng Zhang pengz@uber.com

making the checkpoint file name optional to enable user to save model checkpoint in .h5 format.

@chongxiaoc
Copy link
Collaborator

Hmmm, it looks like unit test failed due to this change:

[test-cpu-gloo-py3_8-tf2_5_2-keras2_5_0rc0-torch1_8_1-mxnet1_7_0_p2-pyspark3_2_0)]

test_spark_keras.py:192: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.8/dist-packages/horovod/spark/common/estimator.py:35: in fit
    return super(HorovodEstimator, self).fit(df, params)
/usr/local/lib/python3.8/dist-packages/pyspark/ml/base.py:161: in fit
    return self._fit(dataset)
/usr/local/lib/python3.8/dist-packages/horovod/spark/common/estimator.py:80: in _fit
    return self._fit_on_prepared_data(
/usr/local/lib/python3.8/dist-packages/horovod/spark/keras/estimator.py:274: in _fit_on_prepared_data
    serialized_model = self._load_model_from_checkpoint(run_id)
/usr/local/lib/python3.8/dist-packages/mock/mock.py:1100: in __call__
    return _mock_self._mock_call(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/mock/mock.py:1104: in _mock_call
    return _mock_self._execute_mock_call(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/mock/mock.py:1[167](https://github.com/horovod/horovod/runs/5241268254?check_suite_focus=true#step:160:167): in _execute_mock_call
    result = effect(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/horovod/spark/keras/estimator.py:294: in _load_model_from_checkpoint
    return store.read_serialized_keras_model(
/usr/local/lib/python3.8/dist-packages/horovod/spark/common/store.py:208: in read_serialized_keras_model
    model = keras.models.load_model(ckpt_path)
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/saving/save.py:[206](https://github.com/horovod/horovod/runs/5241268254?check_suite_focus=true#step:160:206): in load_model
    return saved_model_load.load(filepath, compile, options)
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/saving/saved_model/load.py:121: in load
    meta_graph_def = loader_impl.parse_saved_model(path).meta_graphs[0]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

export_dir = 'file:///tmp/tmpv38_6txp/runs/run01/checkpoint.h5'

    def parse_saved_model(export_dir):
      """Reads the savedmodel.pb or savedmodel.pbtxt file containing `SavedModel`.
    
      Args:
        export_dir: String or Pathlike, path to the directory containing the
        SavedModel file.
    
      Returns:
        A `SavedModel` protocol buffer.
    
      Raises:
        IOError: If the file does not exist, or cannot be successfully parsed.
      """
      # Build the path to the SavedModel in pbtxt format.
      path_to_pbtxt = os.path.join(
          compat.as_bytes(compat.path_to_str(export_dir)),
          compat.as_bytes(constants.SAVED_MODEL_FILENAME_PBTXT))
      # Build the path to the SavedModel in pb format.
      path_to_pb = os.path.join(
          compat.as_bytes(compat.path_to_str(export_dir)),
          compat.as_bytes(constants.SAVED_MODEL_FILENAME_PB))
    
      # Parse the SavedModel protocol buffer.
      saved_model = saved_model_pb2.SavedModel()
      if file_io.file_exists(path_to_pb):
        with file_io.FileIO(path_to_pb, "rb") as f:
          file_content = f.read()
        try:
          saved_model.ParseFromString(file_content)
          return saved_model
        except message.DecodeError as e:
          raise IOError("Cannot parse file %s: %s." % (path_to_pb, str(e)))
      elif file_io.file_exists(path_to_pbtxt):
        with file_io.FileIO(path_to_pbtxt, "rb") as f:
          file_content = f.read()
        try:
          text_format.Merge(file_content.decode("utf-8"), saved_model)
          return saved_model
        except text_format.ParseError as e:
          raise IOError("Cannot parse file %s: %s." % (path_to_pbtxt, str(e)))
      else:
>       raise IOError(
            "SavedModel file does not exist at: %s%s{%s|%s}" %
            (export_dir, os.path.sep, constants.SAVED_MODEL_FILENAME_PBTXT,
             constants.SAVED_MODEL_FILENAME_PB))
E       OSError: SavedModel file does not exist at: file:///tmp/tmpv38_6txp/runs/run01/checkpoint.h5/{saved_model.pbtxt|saved_model.pb}

@github-actions
Copy link

github-actions bot commented Feb 18, 2022

Unit Test Results

     830 files  +  35       830 suites  +35   9h 13m 45s ⏱️ - 12m 21s
     722 tests ±    0       679 ✔️ +    1       43 💤 ±    0  0  - 1 
18 037 runs  +758  12 787 ✔️ +469  5 250 💤 +290  0  - 1 

Results for commit b0cc0fa. ± Comparison against base commit 71e10b4.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Feb 18, 2022

Unit Test Results (with flaky tests)

     920 files   -   55       920 suites   - 55   9h 34m 44s ⏱️ - 30m 19s
     722 tests ±    0       678 ✔️ ±    0       43 💤 ±    0  1 ±0 
19 990 runs   - 526  13 994 ✔️  - 217  5 995 💤  - 309  1 ±0 

For more details on these failures, see this check.

Results for commit b0cc0fa. ± Comparison against base commit 71e10b4.

♻️ This comment has been updated with latest results.

@irasit
Copy link
Collaborator Author

irasit commented Feb 18, 2022

Hmmm, it looks like unit test failed due to this change:

[test-cpu-gloo-py3_8-tf2_5_2-keras2_5_0rc0-torch1_8_1-mxnet1_7_0_p2-pyspark3_2_0)]

test_spark_keras.py:192: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.8/dist-packages/horovod/spark/common/estimator.py:35: in fit
    return super(HorovodEstimator, self).fit(df, params)
/usr/local/lib/python3.8/dist-packages/pyspark/ml/base.py:161: in fit
    return self._fit(dataset)
/usr/local/lib/python3.8/dist-packages/horovod/spark/common/estimator.py:80: in _fit
    return self._fit_on_prepared_data(
/usr/local/lib/python3.8/dist-packages/horovod/spark/keras/estimator.py:274: in _fit_on_prepared_data
    serialized_model = self._load_model_from_checkpoint(run_id)
/usr/local/lib/python3.8/dist-packages/mock/mock.py:1100: in __call__
    return _mock_self._mock_call(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/mock/mock.py:1104: in _mock_call
    return _mock_self._execute_mock_call(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/mock/mock.py:1[167](https://github.com/horovod/horovod/runs/5241268254?check_suite_focus=true#step:160:167): in _execute_mock_call
    result = effect(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/horovod/spark/keras/estimator.py:294: in _load_model_from_checkpoint
    return store.read_serialized_keras_model(
/usr/local/lib/python3.8/dist-packages/horovod/spark/common/store.py:208: in read_serialized_keras_model
    model = keras.models.load_model(ckpt_path)
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/saving/save.py:[206](https://github.com/horovod/horovod/runs/5241268254?check_suite_focus=true#step:160:206): in load_model
    return saved_model_load.load(filepath, compile, options)
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/saving/saved_model/load.py:121: in load
    meta_graph_def = loader_impl.parse_saved_model(path).meta_graphs[0]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

export_dir = 'file:///tmp/tmpv38_6txp/runs/run01/checkpoint.h5'

    def parse_saved_model(export_dir):
      """Reads the savedmodel.pb or savedmodel.pbtxt file containing `SavedModel`.
    
      Args:
        export_dir: String or Pathlike, path to the directory containing the
        SavedModel file.
    
      Returns:
        A `SavedModel` protocol buffer.
    
      Raises:
        IOError: If the file does not exist, or cannot be successfully parsed.
      """
      # Build the path to the SavedModel in pbtxt format.
      path_to_pbtxt = os.path.join(
          compat.as_bytes(compat.path_to_str(export_dir)),
          compat.as_bytes(constants.SAVED_MODEL_FILENAME_PBTXT))
      # Build the path to the SavedModel in pb format.
      path_to_pb = os.path.join(
          compat.as_bytes(compat.path_to_str(export_dir)),
          compat.as_bytes(constants.SAVED_MODEL_FILENAME_PB))
    
      # Parse the SavedModel protocol buffer.
      saved_model = saved_model_pb2.SavedModel()
      if file_io.file_exists(path_to_pb):
        with file_io.FileIO(path_to_pb, "rb") as f:
          file_content = f.read()
        try:
          saved_model.ParseFromString(file_content)
          return saved_model
        except message.DecodeError as e:
          raise IOError("Cannot parse file %s: %s." % (path_to_pb, str(e)))
      elif file_io.file_exists(path_to_pbtxt):
        with file_io.FileIO(path_to_pbtxt, "rb") as f:
          file_content = f.read()
        try:
          text_format.Merge(file_content.decode("utf-8"), saved_model)
          return saved_model
        except text_format.ParseError as e:
          raise IOError("Cannot parse file %s: %s." % (path_to_pbtxt, str(e)))
      else:
>       raise IOError(
            "SavedModel file does not exist at: %s%s{%s|%s}" %
            (export_dir, os.path.sep, constants.SAVED_MODEL_FILENAME_PBTXT,
             constants.SAVED_MODEL_FILENAME_PB))
E       OSError: SavedModel file does not exist at: file:///tmp/tmpv38_6txp/runs/run01/checkpoint.h5/{saved_model.pbtxt|saved_model.pb}

Seems the docker do not have h5py installed, which makes the keras.load_model think it is a path.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/saving/save.py#L206

@irasit
Copy link
Collaborator Author

irasit commented Feb 18, 2022

Seems h5py is installed by default, base on the if condition
if (h5py is not None and (isinstance(filepath, h5py.File) or h5py.is_hdf5(filepath))):
the root cause may be that the file is not saved correctly into h5 format.

Signed-off-by: Peng Zhang <pengz@uber.com>
Signed-off-by: Peng Zhang <pengz@uber.com>
@irasit irasit changed the title change checkpoint name to h5 Make checkpoint name optional so that user can shave to h5 format. Feb 28, 2022
@irasit irasit changed the title Make checkpoint name optional so that user can shave to h5 format. Make checkpoint name optional so that user can save to h5 format. Feb 28, 2022
@irasit irasit requested a review from tgaddair March 1, 2022 21:28
@chongxiaoc chongxiaoc merged commit 7bf9b04 into master Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants