Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

on_train_epoch_end and on_epoch_end are out of order #4001

Closed
ghost opened this issue Oct 8, 2020 · 11 comments 路 Fixed by #4010
Closed

on_train_epoch_end and on_epoch_end are out of order #4001

ghost opened this issue Oct 8, 2020 · 11 comments 路 Fixed by #4010
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@ghost
Copy link

ghost commented Oct 8, 2020

馃悰 Bug

Consider the following order in which the LightningModule hooks are called from #2816 (I have confirmed that in PytorchLightning version 0.10 this is still an issue):

on_epoch_start
on_train_epoch_start
on_validation_start
on_validation_epoch_start
on_validation_epoch_end
on_validation_end
on_epoch_end
on_train_epoch_end

Naturally one would expect the opening and closing scope hooks to match. However, on_train_epoch_end is called after on_epoch_end, which seems incorrect. It is natural to open the epoch scope before the train epoch scope (as is being done currently), in which case the epoch scope should be closed after closing the train epoch scope (which is not currently being done)

  • PyTorch Version (e.g., 1.0): 1.6.0
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.8.5
  • CUDA/cuDNN version: NA
  • GPU models and configuration: NA
  • Any other relevant information: NA
@ghost ghost added bug Something isn't working help wanted Open to be worked on labels Oct 8, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2020

Hi! thanks for your contribution!, great first issue!

@ghost ghost changed the title on_train_epoch_end and on_epoch_end are out of order on_train_epoch_end and on_epoch_end are out of order Oct 8, 2020
@Lightning-AI Lightning-AI deleted a comment from edenlightning Oct 8, 2020
@Borda
Copy link
Member

Borda commented Oct 8, 2020

@wyessen it seems that the flow is incorrect the on_train_epoch_end shall be before on_validation_start
cc: @williamFalcon

@ghost
Copy link
Author

ghost commented Oct 8, 2020

@Borda

True, but depends how we should look at it: should validation be considered part of the training epoch scope? If so, then the current flow is fine; otherwise, you're right, it's incorrect.

So, the original bug report complains about incorrect closing of the scope given the order in which the scopes were opened. You raise a valid issue, and in the broader scheme of things the current flow should be reconsidered.

@williamFalcon
Copy link
Contributor

williamFalcon commented Oct 8, 2020

yes. validation is part of the flow. as mentioned many times, big research requires checking val multiple times within an epoch

Train: --------------------------- (1 epoch = 2 days)
Val -- -- --

Then we have:

e1 = on_epoch_start
e2 = on_epoch_end
t1 = on_train_epoch_start
t2 = on_train_epoch_end
v1 = on_val_epoch_start
v2 = on_val_epoch_end

Train: e1 t1 ------------------------------------------------------------- t2 e2
Val : ________(e1 v1 --val--- v2 e2) ___ (e1 v1 --val---v2 e2)

@Borda
Copy link
Member

Borda commented Oct 8, 2020

here is an added test to check the actual flow #4010
I think the confusion comes from using smaller models or just one validation per epoch, then you would expect to have called the validation after training...

@ghost
Copy link
Author

ghost commented Oct 8, 2020

@williamFalcon @Borda So as to not hijack the original bug report, I want to clarify:

The flow executed by PytorchLightning is incorrect in the sense that opening of a scope (with _start) does not match closing of the scope with (with _end). In particular on_epoch_end is called before on_train_epoch_end, which is not correct.

@edenlightning edenlightning added this to the 1.0.3 milestone Oct 19, 2020
@ghost
Copy link
Author

ghost commented Oct 20, 2020

@SeanNaren Why did you close this issue? Your PR does not fix this.

@SeanNaren SeanNaren reopened this Oct 20, 2020
@SeanNaren
Copy link
Contributor

SeanNaren commented Oct 20, 2020

Apologies, I think the PR associated with this issue was incorrect!

EDIT: after looking at the associated PR and the discussion here, I do think this PR addresses the issue of ensuring the order is correct. Was there anything in particular that wasn't addressed @wyessen?

@Borda
Copy link
Member

Borda commented Oct 20, 2020

@wyessen this was closed with @williamFalcon explanation that the behavior is as expected 馃惏

@SeanNaren
Copy link
Contributor

Will close again for now...

@ghost
Copy link
Author

ghost commented Oct 20, 2020

@Borda the behavior is not expected, please read my explanation (@williamFalcon was responding to @Borda鈥檚 message, which was different from my original issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants