[v1.12.0] Fix non-reentrant hooks based checkpointing #79490

rohan-varma · 2022-06-14T01:20:26Z

Link to landed master PR: #78752

Original commit description:

Fixes the non-reentrant hooks based checkpointing to actually save memory. The issue was that storage was a list of autograd saved tensors and we weren't clearing this list out as tensors were accessed, so all activations would remain in memory. Now at the end of the layer's backwards pass, activations will be discarded as expected.

Adding unittests to ensure:

Memory savings for a basic model compared to no checkpointing
Same or better memory savings when compared with the reentrant autograd based hooks checkpointing

Also, this means we can enable non-reentrant based checkpointing in CheckpointWrapper, will also add unittests for that.

facebook-github-bot · 2022-06-14T01:20:32Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/79490
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit ad2e8ff (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

malfet

PR targeting release branch does not seem to match one for master.
@rohan-varma can you please explain the differences?

rohan-varma · 2022-06-14T20:41:44Z

Discussing offline with @malfet

rohan-varma · 2022-06-15T16:18:50Z

Only difference is change in set of tests in test_checkpoint_wrapper, as that file contained tests for features merged after the branch cut. @malfet

merge fix

b5a28ae

rohan-varma requested review from mrshenli, pritamdamania87, zhaojuanmao, H-Huang, awgu and mingzhe09088 as code owners June 14, 2022 01:20

facebook-github-bot added the cla signed label Jun 14, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 14, 2022

rohan-varma mentioned this pull request Jun 14, 2022

[v.1.12.0] Release Tracker #78005

Closed

rohan-varma added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jun 14, 2022

Test fix

8246e4f

malfet requested changes Jun 14, 2022

View reviewed changes

Lint

ad2e8ff

rohan-varma requested a review from malfet June 16, 2022 03:09

malfet approved these changes Jun 17, 2022

View reviewed changes

malfet merged commit 681a6e3 into release/1.12 Jun 17, 2022

malfet deleted the cherry_pick_checkpoint branch June 20, 2022 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.12.0] Fix non-reentrant hooks based checkpointing #79490

[v1.12.0] Fix non-reentrant hooks based checkpointing #79490

rohan-varma commented Jun 14, 2022

facebook-github-bot commented Jun 14, 2022 •

edited

malfet left a comment

rohan-varma commented Jun 14, 2022

rohan-varma commented Jun 15, 2022

[v1.12.0] Fix non-reentrant hooks based checkpointing #79490

[v1.12.0] Fix non-reentrant hooks based checkpointing #79490

Conversation

rohan-varma commented Jun 14, 2022

facebook-github-bot commented Jun 14, 2022 • edited

🔗 Helpful links

✅ No Failures (0 Pending)

malfet left a comment

Choose a reason for hiding this comment

rohan-varma commented Jun 14, 2022

rohan-varma commented Jun 15, 2022

facebook-github-bot commented Jun 14, 2022 •

edited