Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes the checkpoint directory structure for pytorch and pytorch lightning #3362

Merged
merged 4 commits into from Jan 21, 2022

Conversation

kamalsharma2
Copy link
Contributor

@kamalsharma2 kamalsharma2 commented Jan 13, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #3268. Creating a copy() method to replace put() while syncing from local temporary directory to run path. put() is inconsistent with cp and currently has no mechanism to paste the contents from source directory to target without pasting the source directory itself if the target directory is already present, this case is unwanted for horovod and replacing this creates consistency for checkpoints path for both pytorch and pytorch lightning.

Signed-off-by: Kamal Sharma <kamalbhardwaj020@gmail.com>
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor formatting, can't say much about the logic

horovod/spark/common/store.py Outdated Show resolved Hide resolved
horovod/spark/common/store.py Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Jan 13, 2022

Unit Test Results

     802 files  ±0       802 suites  ±0   9h 0m 31s ⏱️ + 16m 5s
     717 tests ±0       674 ✔️ +  2       43 💤  -   2  0 ±0 
17 324 runs  ±0  12 256 ✔️ +18  5 068 💤  - 18  0 ±0 

Results for commit 06f1c4a. ± Comparison against base commit b96ecae.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jan 13, 2022

Unit Test Results (with flaky tests)

     901 files  +  11       901 suites  +11   9h 46m 46s ⏱️ + 30m 59s
     717 tests ±    0       671 ✔️ ±    0       43 💤  - 2  2 +1  1 🔥 +1 
19 601 runs  +171  13 741 ✔️ +177  5 857 💤  - 8  2 +1  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 06f1c4a. ± Comparison against base commit b96ecae.

♻️ This comment has been updated with latest results.

Signed-off-by: Kamal Sharma <kamalbhardwaj020@gmail.com>
Signed-off-by: Kamal Sharma <kamalbhardwaj020@gmail.com>
@kamalsharma2
Copy link
Contributor Author

@EnricoMi resolved the comments, please take a look

@EnricoMi
Copy link
Collaborator

@tgaddair @irasit what do you think?

Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a question.

Signed-off-by: Kamal Sharma <kamalbhardwaj020@gmail.com>
Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chongxiaoc chongxiaoc merged commit a89efa9 into horovod:master Jan 21, 2022
@zyluo
Copy link

zyluo commented Mar 5, 2022

fsspec version requirement introduced from this commit

#3450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Unable to load most recent checkpoint for Pytorch and Pytorch lightning Estimator
5 participants