Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

Anner-deJong
Copy link
Contributor

@Anner-deJong Anner-deJong commented Mar 7, 2024

PR UNFINISHED BUT MAJOR SETUP READY FOR REVIEW TO DISCUSS HOW TO CONTINUE (i.e. review global logic & ignore 'DO NOT SUBMIT' blocks)

(before updating docs & tests in vain in case community/maintainers settle on a different approach or default)

What does this PR do?

Background

Pre this PR, in a setup using multiple optimizers (and thus a manual optimization loop) each optimizer contributes to the global step counter.

For many people this is unexpected (see #17958), as (my conjecture) the default expectation [1] is that each call to training_step counts as 1 increment to trainer.global_step

This unexpected behaviour can easily go unnoticed, and leads to issues:

  • global_step based logic (e.g. logging, manual lr schedulers, etc.) is unexpected/wrong (in the eye of the user who does not expect this behaviour)
  • Personal case: I used two optimizers for two parts of my model, calling step on both in each training_step, which somehow indirectly caused my gradients to become 0: My model stopped training without me initially realizing why. The quick fix mentioned in incorrect global_step with multiple optimizers and automatic_optimization=False  #17958 immediately resolved this again..

Proposed behaviour

  • In line with expectation [1] above, the default mode could be (without the user having to do anything), according to Pytorch Lightning's One less thing to remember, only the last optimizer (order as provided by the user) incrementing the counter. (perfect for less complicated use cases, e.g. having a low lr rate optimizer for a pre-trained backbone and a separate optimizer + scheduler for a newly trained head; each optimizer's step being called in each training_step)
  • For more complicated setting with e.g. GAN's users can specify themselves how they want the global_step to behave (by updating the documentation accordingly)

Actual PR implications

The current PR does NOT implement expectation [1] per se. Instead, its a simpler update in that direction that yet allows for likely sufficient versatility (If there is a single incrementing optimizer, number of calls to training_step can still be misaligned if said optimizer is not called each step or multiple times per step)

  • it implements the logic for a user to specify which optimizers should count towards the global_step counter
  • it takes the default (user does not specify anything) as only the last optimizer incrementing (in line with above Proposed behaviour)

Alternative solutions:

  • trainer.global_step can be separated from optimizers and just a counter for the number of function calls. Clean on the surface, this sounds more invasive under the hood (expecting there was a good reason to make it according to opt.steps() calls and not training_step function calls) (this also does not allow custom counting behaviour anymore, unless we link the two again which sounds like itll lead to messy code setup)
  • Same PR but different 'default' e.g. prevent change of behaviour going forward (see below in breaking) and instead having the above proposed 'default' instead as an option

Fixes #17958

Indirectly: this changes the global_step counter and thus changes behaviour. Anybody who does not expect the former behaviour will have their counter fixed and making sense to them, anybody who does expect the former behaviour has to update their configure_optimizers to have that behaviour persist.

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [wip] Did you make sure to update the documentation with your changes? (if necessary)
  • [wip] Did you write any new necessary tests? (not for typos and docs)
  • [wip] Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • [ ]Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19589.org.readthedocs.build/en/19589/

@github-actions github-actions bot added docs Documentation related pl Generic label for PyTorch Lightning package labels Mar 7, 2024
@Anner-deJong
Copy link
Contributor Author

@carmocca Tagging you as you seem most relevant regarding the manual optimization loop, sorry.. 😬

I'm eager to continue this / clean it up / adjust according to discussions - Any opinion?
If it looks good to you, ill go ahead and brush clean the implementation 🙏

Anner-deJong and others added 3 commits March 11, 2024 18:52
…gy stateful so it has access each time when re-assigning. Other bug left: multi dict does not return all should_increment, ending with mismatch of optimizer and should_increment size
…to bugfix/17958_multi_optimizer_step_count_behaviour
@Anner-deJong Anner-deJong changed the title [WIP BUT BLOCKED ON MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour [WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour Mar 13, 2024
@Anner-deJong
Copy link
Contributor Author

@awaelchli - didn't hear back from carmocca yet, trying you instead :)
Note: I know this PR needs more work; I just need some high level feedback at this stage to know how to finish up the PR, and am thus currently blocked.

Any feedback is much appreciated 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorrect global_step with multiple optimizers and automatic_optimization=False
1 participant