Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

crashing dump for hparams containing fsspec #4156

Closed
Borda opened this issue Oct 14, 2020 · 6 comments 路 Fixed by #4158
Closed

crashing dump for hparams containing fsspec #4156

Borda opened this issue Oct 14, 2020 · 6 comments 路 Fixed by #4158
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@Borda
Copy link
Member

Borda commented Oct 14, 2020

馃悰 Bug

If you pass a model as PL model argument and dum this model, it crashes...

Please reproduce using [the BoringModel and post here]

Failing several models in Bolts
https://github.com/PyTorchLightning/pytorch-lightning-bolts/runs/1254957648
sample of failing test:
https://github.com/PyTorchLightning/pytorch-lightning-bolts/blob/ccdc9f9952153abf9b12f00e05294fa332d5f424/tests/models/test_vision.py#L8-L31

Additional context

        cls = type(data)
        if cls in copyreg.dispatch_table:
            reduce = copyreg.dispatch_table[cls](data)
        elif hasattr(data, '__reduce_ex__'):
>           reduce = data.__reduce_ex__(2)
E           TypeError: __reduce_ex__() takes exactly one argument (0 given)
@Borda Borda added bug Something isn't working help wanted Open to be worked on labels Oct 14, 2020
@Borda Borda changed the title crashing dump for hparams containing Model crashing dump for hparams containing DataModule Oct 14, 2020
@Borda
Copy link
Member Author

Borda commented Oct 14, 2020

seems that there is something special about Bolts's MNISTDataModule, any idea @nateraw?

@nateraw
Copy link
Contributor

nateraw commented Oct 14, 2020

This is the exact error I ran into in my PR as well. I don't know what's causing it 馃檨

@Borda
Copy link
Member Author

Borda commented Oct 14, 2020

ok, so the more trace is that DM has a pointer to Trainer and somewhere in Trainer there is <fsspec.implementations.local.LocalFileSystem object at 0x13ba296a0> which causes this problem...

EDIT: MNISTDataModule -> Trainer -> CheckPoint -> fsspec

@Borda Borda changed the title crashing dump for hparams containing DataModule crashing dump for hparams containing fsspec Oct 14, 2020
@Borda
Copy link
Member Author

Borda commented Oct 14, 2020

{'datamodule': <pl_bolts.datamodules.mnist_datamodule.MNISTDataModule object at 0x13b991dd8>, 'embed_dim': 16, 'heads': 2, 'layers': 2, 'pixels': 28, 'vocab_size': 16, 'num_classes': 10, 'classify': False, 'batch_size': 64, 'learning_rate': 0.01, 'steps': 25000, 'data_dir': '.', 'num_workers': 8}
vvv
<pl_bolts.datamodules.mnist_datamodule.MNISTDataModule object at 0x13b991dd8>
vvv
<pytorch_lightning.trainer.trainer.Trainer object at 0x12ec29780>
vvv
[<pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0x13ba29630>, <pytorch_lightning.callbacks.progress.ProgressBar object at 0x13ba29668>]
vvv
<pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0x13ba29630>
vvv
<fsspec.implementations.local.LocalFileSystem object at 0x13ba296a0>

@ananthsub
Copy link
Contributor

should hparams be a more formal concept in the lightning module? given how complex objects can be passed to the module or the trainer, should users explicitly save set the hyperparameters their module uses? and should lightning skip saving them if they're not set? i wonder how the frame capture logic for arguments is going to hold up over time.

@Borda
Copy link
Member Author

Borda commented Oct 15, 2020

should hparams be a more formal concept in the lightning module? given how complex objects can be passed to the module or the trainer, should users explicitly save set the hyperparameters their module uses? and should lightning skip saving them if they're not set? i wonder how the frame capture logic for arguments is going to hold up over time.

good questions, we do not want to limit users much, I made the two following fixes:

  1. save only parameters which can be saved and some is invalid skip it and warn user, this is just preventing to crash in runtime, Bugfix/4156 filter hparams for yaml - fsspec聽#4158
  2. the problem came from runtimeTTrainer linking to used instances so it is impossible to do any static check or other checks before training starts, also this runtime changes are not needed for Init state reconstruction, see we save just the initial hparams, save initial arguments聽#4163

note that each of this PR solves just a part of the issue 馃惏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants