Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.12.0] Fix non-reentrant hooks based checkpointing #79490

Merged
merged 3 commits into from Jun 17, 2022

Conversation

rohan-varma
Copy link
Member

Link to landed master PR: #78752

Original commit description:

Fixes the non-reentrant hooks based checkpointing to actually save memory. The issue was that storage was a list of autograd saved tensors and we weren't clearing this list out as tensors were accessed, so all activations would remain in memory. Now at the end of the layer's backwards pass, activations will be discarded as expected.

Adding unittests to ensure:

  • Memory savings for a basic model compared to no checkpointing
  • Same or better memory savings when compared with the reentrant autograd based hooks checkpointing

Also, this means we can enable non-reentrant based checkpointing in CheckpointWrapper, will also add unittests for that.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 14, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit ad2e8ff (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 14, 2022
@rohan-varma rohan-varma added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jun 14, 2022
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR targeting release branch does not seem to match one for master.
@rohan-varma can you please explain the differences?

@rohan-varma
Copy link
Member Author

Discussing offline with @malfet

@rohan-varma
Copy link
Member Author

Only difference is change in set of tests in test_checkpoint_wrapper, as that file contained tests for features merged after the branch cut. @malfet

@rohan-varma rohan-varma requested a review from malfet June 16, 2022 03:09
@malfet malfet merged commit 681a6e3 into release/1.12 Jun 17, 2022
@malfet malfet deleted the cherry_pick_checkpoint branch June 20, 2022 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants