You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
autograd.detect_anomaly detects inf/nan in the backward pass.
I want to have the same in the forward pass. With the possibility to whitelist a few special operations, modules or code blocks, e.g. masking attention energies to -inf, etc.
Maybe via a post forward hook for every module?
Or using sys.settrace to install some function which would inspect the locals?
E.g. I have a model which gets NAN loss (so forward pass) (directly first step) and I want to know where it happens. (This is with AMP float16, so maybe pytorch/pytorch#40497 is related. But this is only about AMP, not so much about the issue here on adding such detection.)
The same mechanism can then also be used to collect statistics on activations (mean, min, max, std, var, median, L2, whatever).
Note that the same for parameters is much easier, as we can simply iterate over them.
The text was updated successfully, but these errors were encountered:
autograd.detect_anomaly
detects inf/nan in the backward pass.I want to have the same in the forward pass. With the possibility to whitelist a few special operations, modules or code blocks, e.g. masking attention energies to -inf, etc.
Maybe via a post forward hook for every module?
Or using
sys.settrace
to install some function which would inspect the locals?Also see https://discuss.pytorch.org/t/detect-inf-nan-in-forward-pass/190514.
E.g. I have a model which gets NAN loss (so forward pass) (directly first step) and I want to know where it happens. (This is with AMP float16, so maybe pytorch/pytorch#40497 is related. But this is only about AMP, not so much about the issue here on adding such detection.)
The same mechanism can then also be used to collect statistics on activations (mean, min, max, std, var, median, L2, whatever).
Note that the same for parameters is much easier, as we can simply iterate over them.
The text was updated successfully, but these errors were encountered: