New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFDataset (or generic dataset) post processing #1505
Comments
I would be very interested in this functionality, but I did not put too much thought in yet on how this would look best. I am definitely a fan to use MetaDataset for everything, so I would not bother if it would be part of that. |
Commenting on your 3 suggestions in order:
|
Well, you would factor this out, to have the logic in some common class But yes, I kind of agree.
Well, we do use combinations/transformations of datasets already, e.g. see So, such postprocessing logic would not really add anything new there - it fits very natural in how
There is also another aspect which becomes ambiguous: The Note, we also have the |
Regarding 1: |
@JackTemaki argued, he (and many others) anyway use Regarding So, how would that variant look like? Just a function in the config, like: def dataset_post_process(data: TensorDict) -> TensorDict: ... ? |
We definetly need to distinguish train and dev datasets. If we do data augmentation for training we don't necessarily want to do it for cross-validation. If we are doing some sort of format conversion then it would be needed for both. So in the end this should be a user choice. We can also have the type of dataset (train/dev/...) as one argument to the post-processing function. |
How would the user specify such post-processing function per dataset? It could be another argument for the dataset itself, so the user specifies it like: train = {
...,
"post_process": my_train_dataset_post_proc,
}
dev = {
...,
"post_process": my_dev_dataset_post_proc,
} It's a bit ugly, because dataset_post_process_funcs = {
"train": my_train_dataset_post_proc,
"dev": my_dev_dataset_post_proc,
} This is maybe fine for the training task, but for search or forward, it's ambiguous, and also it doesn't really work if RETURNN is used for scripting and not as a standalone tool. |
Could we add this to (every) engine class? The engine knows what kind of task it performs and what dataloader it uses for that task and could pick the correct post-processing function for that task. |
The post-processing function is not per task but per dataset. At least that is what I wrote above. Or do you want to have it per task? But I guess you don't really want it per task, but rather whether you train or eval? Or maybe a generic def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ... |
Sorry, I was not precise enough. What I meant was that in the engine class you know for what the dataset is used and from which name in the config it comes from (I hope). Then one can select the correct postprocessing function to go with it (by picking it from the dict you showed above with the correct key). |
No, you don't. E.g. we have this API for forward: def forward_with_callback(self, *, dataset: Dataset, callback: ForwardCallbackIface): ... Or this API for the init (including training): def init_train_from_config(
self,
config: Optional[Config] = None,
train_data: Optional[Dataset] = None,
dev_data: Optional[Dataset] = None,
eval_data: Optional[Dataset] = None,
): ... It is handled by our But yes, so I guess we can simply use this API: def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ... |
OK, let's use |
One aspect I realized now: Where exactly would this be executed? As this is now outside the dataset, |
I would also be interested in this feature. The discussed post processing solution seems fine to me. However, I would definitely like to have the post processing parallelizable into multiple procs. At least now, I have a setup with an |
Another aspect came up (@Judyxujj): We were interested in implementing mixup in this post processing function. But this is not really possible with the current design. This additionally needs:
(Note, I have a mixup implementation, but I did it inside the model, via a |
Examples of post-processing:
Vocabulary
) on-the-fly.Some datasets already have partial support for post processing. Examples:
OggZipDataset
targets
, can be any type ofVocabulary
(e.g.BytePairEncoding
, but alsoSamplingBytePairEncoding
, orSentencePieces
). SimilarlyExternSprintDataset
orth_vocab
.ExtractAudioFeatures
is used in a couple of places, e.g. byOggZipDataset
audio
. It supports alsopre_process
(on raw audio) andpost_process
(on audio features).There was the idea about storing generic raw audio (or maybe even Ogg) inside the HDFDataset. And similarly, there was also the idea about storing the text (UTF8 bytes) inside the HDFDataset. In both cases, you would then maybe want to transform those into audio features or BPE labels on-the-fly as part of the dataset.
There are multiple options how to implement this:
OggZipDataset
, extend some other dataset (e.g.HDFDataset
) by such functionality. But how to do it in a somewhat generic and flexible way? One aspect to keep in mind is that this might also change the dimension or shape of the data. E.g. raw audio to audio features will add one dimension.MetaDataset
. Or maybe make it part ofMetaDataset
?TensorDict
(before batching), or on individual data streams. But the distinction when something should be done as part of the dataset and when it would be done as such post-processing would be kind of arbitrary.The text was updated successfully, but these errors were encountered: