[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589

Anner-deJong · 2024-03-07T14:37:32Z

PR UNFINISHED BUT MAJOR SETUP READY FOR REVIEW TO DISCUSS HOW TO CONTINUE (i.e. review global logic & ignore 'DO NOT SUBMIT' blocks)

(before updating docs & tests in vain in case community/maintainers settle on a different approach or default)

What does this PR do?

Background

Pre this PR, in a setup using multiple optimizers (and thus a manual optimization loop) each optimizer contributes to the global step counter.

For many people this is unexpected (see #17958), as (my conjecture) the default expectation [1] is that each call to training_step counts as 1 increment to trainer.global_step

This unexpected behaviour can easily go unnoticed, and leads to issues:

global_step based logic (e.g. logging, manual lr schedulers, etc.) is unexpected/wrong (in the eye of the user who does not expect this behaviour)
Personal case: I used two optimizers for two parts of my model, calling step on both in each training_step, which somehow indirectly caused my gradients to become 0: My model stopped training without me initially realizing why. The quick fix mentioned in incorrect global_step with multiple optimizers and automatic_optimization=False #17958 immediately resolved this again..

Proposed behaviour

In line with expectation [1] above, the default mode could be (without the user having to do anything), according to Pytorch Lightning's One less thing to remember, only the last optimizer (order as provided by the user) incrementing the counter. (perfect for less complicated use cases, e.g. having a low lr rate optimizer for a pre-trained backbone and a separate optimizer + scheduler for a newly trained head; each optimizer's step being called in each training_step)
For more complicated setting with e.g. GAN's users can specify themselves how they want the global_step to behave (by updating the documentation accordingly)

Actual PR implications

The current PR does NOT implement expectation [1] per se. Instead, its a simpler update in that direction that yet allows for likely sufficient versatility (If there is a single incrementing optimizer, number of calls to training_step can still be misaligned if said optimizer is not called each step or multiple times per step)

it implements the logic for a user to specify which optimizers should count towards the global_step counter
it takes the default (user does not specify anything) as only the last optimizer incrementing (in line with above Proposed behaviour)

Alternative solutions:

trainer.global_step can be separated from optimizers and just a counter for the number of function calls. Clean on the surface, this sounds more invasive under the hood (expecting there was a good reason to make it according to opt.steps() calls and not training_step function calls) (this also does not allow custom counting behaviour anymore, unless we link the two again which sounds like itll lead to messy code setup)
Same PR but different 'default' e.g. prevent change of behaviour going forward (see below in breaking) and instead having the above proposed 'default' instead as an option

Fixes #17958

Indirectly: this changes the global_step counter and thus changes behaviour. Anybody who does not expect the former behaviour will have their counter fixed and making sense to them, anybody who does expect the former behaviour has to update their configure_optimizers to have that behaviour persist.

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
[wip] Did you make sure to update the documentation with your changes? (if necessary)
[wip] Did you write any new necessary tests? (not for typos and docs)
[wip] Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
[ ]Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19589.org.readthedocs.build/en/19589/

…to multi_opt_step_counter

for more information, see https://pre-commit.ci

…to bugfix/17958_multi_optimizer_step_count_behaviour

Anner-deJong · 2024-03-11T15:38:18Z

@carmocca Tagging you as you seem most relevant regarding the manual optimization loop, sorry.. 😬

I'm eager to continue this / clean it up / adjust according to discussions - Any opinion?
If it looks good to you, ill go ahead and brush clean the implementation 🙏

…gy stateful so it has access each time when re-assigning. Other bug left: multi dict does not return all should_increment, ending with mismatch of optimizer and should_increment size

…to bugfix/17958_multi_optimizer_step_count_behaviour

for more information, see https://pre-commit.ci

Anner-deJong · 2024-03-20T14:52:26Z

@awaelchli - didn't hear back from carmocca yet, trying you instead :)
Note: I know this PR needs more work; I just need some high level feedback at this stage to know how to finish up the PR, and am thus currently blocked.

Any feedback is much appreciated 🙏

Anner-deJong added 4 commits March 7, 2024 14:00

first version

bd5e175

add comment to not submit Makefile changes

cd0340c

Merge branch 'master' of github.com:Lightning-AI/pytorch-lightning in…

6d8eb59

…to multi_opt_step_counter

add edge condition (wip) + make robust to optimizer list lenght 0

6bae025

Anner-deJong requested review from williamFalcon, tchaton, awaelchli, carmocca, justusschock, Borda and lantiga as code owners March 7, 2024 14:37

github-actions bot added docs Documentation related pl Generic label for PyTorch Lightning package labels Mar 7, 2024

Anner-deJong and others added 6 commits March 7, 2024 15:37

small comments

2360ecd

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8c44ec

for more information, see https://pre-commit.ci

small mypy updates

c286374

merge remote pre-commit

b788af4

switch double to single quotation marks for f-string

bb5bc85

Merge branch 'master' of github.com:Lightning-AI/pytorch-lightning in…

e28e8f2

…to bugfix/17958_multi_optimizer_step_count_behaviour

Anner-deJong and others added 3 commits March 11, 2024 18:52

bug fix: strategy._lightning_optimizers gets overwritten, make strate…

1301b78

…gy stateful so it has access each time when re-assigning. Other bug left: multi dict does not return all should_increment, ending with mismatch of optimizer and should_increment size

Merge branch 'master' of github.com:Lightning-AI/pytorch-lightning in…

108696f

…to bugfix/17958_multi_optimizer_step_count_behaviour

[pre-commit.ci] auto fixes from pre-commit.com hooks

6eab1f1

for more information, see https://pre-commit.ci

Anner-deJong changed the title ~~[WIP BUT BLOCKED ON MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour~~ [WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589

[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589

Anner-deJong commented Mar 7, 2024 •

edited

Anner-deJong commented Mar 11, 2024

Anner-deJong commented Mar 20, 2024

[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589

Are you sure you want to change the base?

[WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour #19589

Conversation

Anner-deJong commented Mar 7, 2024 • edited

PR UNFINISHED BUT MAJOR SETUP READY FOR REVIEW TO DISCUSS HOW TO CONTINUE (i.e. review global logic & ignore 'DO NOT SUBMIT' blocks)

What does this PR do?

Background

Proposed behaviour

Actual PR implications

Alternative solutions:

PR review

Anner-deJong commented Mar 11, 2024

Anner-deJong commented Mar 20, 2024

Anner-deJong commented Mar 7, 2024 •

edited