New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Simpler mechanism to publish data from LightningModule and Callback to Trainer #11715
Comments
The design of "logging" has been iteratively developed since before 1.0. A small recap of the current design:
The debugging experience is hard by design. A minimalistic API will always be a tradeoff to flexibility and visibility. This is great for the general usebase but tricky when things go wrong. It's not really a problem but a tradeoff.
It might seem that way from the viewpoint of a user unaware of the internal design but different components take responsibility of the different actions:
The trainer loops drive this, using the
This is not really part of logging but part of loggers. I understand that
I'm not sure what you mean with this point. Everything is reset at the moment. There's an issue suggesting customization: #11262 (comment)
Yes, but why would that be a problem? One of the biggest features of Lightning is to be able to easily use multiple dataloaders.
This is on purpose as we try to be as efficient as possible when tensors are logged. We could simplify a lot if we always forced users to use Metrics but then we are adding boilerplate for simple use cases.
Honestly, there's nothing wrong with your pitch. It basically has the same underlying ideas as However, if you were to take this proposal and try supporting all the features One thing to note is that logging is technically optional to lightning. You could choose to not |
I think @ananthsub 's pitch mitigates a bunch of the issues I have run into while using metrics in lightning. Sharing my experience as a researcher about the problems I have run into - I tried using torchmetrics inside lightning and noticed a few peculiar behaviors. After spending a few hours trying to make sure there's no bugs, I wanted to note my observations so that other users can make note of them in case that's useful and ask for what the recommendation to log meters is.
I think this is a fine call to make - but if that is made, then we should be clear in saying lightning chooses a "minimalistic API" over debuggability and user control - there is no free lunch. For certain things I see no way in the docs to take control over form lightning (like how do I compute a metric without the syncing?). |
It seems the |
The default I would like to re-iterate that one could circumvent the logging internals entirely by using Doing so is very different to "magically" If you (or any future readers) have issues with getting a self-managed logging solution like this to work, I will be happy to help with bugs or refactors to make the internals more flexible. Thank you! |
🚀 Feature
This RFC expresses a desire for a simpler mechanism to send metric data from the LightningModule and Callback to the Trainer.
Motivation
The current approach with
LightningModule.log
has many downsides:Poor debugging experience: we started seeing this sort of failure ([RFC] Depreceate the move_metrics_to_cpu Trainer argument. #10595) after Save the loop progress state by default #10784 . This is the example stacktrace: https://gist.github.com/ananthsub/45c154145d0f852503c6a547f59e91f0 . It is very hard to tell where in logging did I go wrong. We see this error even after updating our torchmetrics dependency.
LightningModule.log conflates too many things:
log_every_n_steps
and the global stepMany of these assumptions came from the original
Result
object. This was a class that preceded the whole torchmetrics project.The
log
API is not straightforward to use given the large number of options available, and differing implementation differences when logging floats/tensors vs torchmetric Metric objects.Pitch
Provide a new API like this:
example calling code:
The trainer already calls all of the hooks offered by the LightningModule & Callback APIs. We have the logic in the trainer here that can inspect the data and reset it after every hook is called: https://github.com/PyTorchLightning/pytorch-lightning/blob/9ebd7df22acc6e0de4569edacd0ec8319ab4be21/pytorch_lightning/trainer/trainer.py#L1522-L1587. Which means data can be taken from here, attached with the global_step or other metadata the trainer is aware of, and routed to the relevant destinations (callbacks/loggers/metrics).
Pros:
sync_dist
,on_step
/on_epoch
,rank_zero_only
ormetric_attribute
amongst otherslog
today.log
put
makes clear that it's separate fromlogging
like Python logging and Lightning's own Loggers. This is generally a means through which the user passes data to the Trainer for usage in other places like callbacks, progress bar, or loggers.Alternatives
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @tchaton @justusschock @awaelchli @carmocca @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy
The text was updated successfully, but these errors were encountered: