Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFDataset (or generic dataset) post processing #1505

Open
albertz opened this issue Jan 31, 2024 · 15 comments
Open

HDFDataset (or generic dataset) post processing #1505

albertz opened this issue Jan 31, 2024 · 15 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 31, 2024

Examples of post-processing:

  • Raw audio is stored in the HDFDataset, do feature extraction on-the-fly (but not in the network, but instead as part of the dataset).
  • Ogg is stored in the HDFDataset, do decoding on-the-fly into raw audio.
  • Text (UTF8 bytes) is stored in the HDFDataset, apply BPE (or any Vocabulary) on-the-fly.

Some datasets already have partial support for post processing. Examples:

  • OggZipDataset targets, can be any type of Vocabulary (e.g. BytePairEncoding, but also SamplingBytePairEncoding, or SentencePieces). Similarly ExternSprintDataset orth_vocab.
  • ExtractAudioFeatures is used in a couple of places, e.g. by OggZipDataset audio. It supports also pre_process (on raw audio) and post_process (on audio features).

There was the idea about storing generic raw audio (or maybe even Ogg) inside the HDFDataset. And similarly, there was also the idea about storing the text (UTF8 bytes) inside the HDFDataset. In both cases, you would then maybe want to transform those into audio features or BPE labels on-the-fly as part of the dataset.

There are multiple options how to implement this:

  • Similarly as in OggZipDataset, extend some other dataset (e.g. HDFDataset) by such functionality. But how to do it in a somewhat generic and flexible way? One aspect to keep in mind is that this might also change the dimension or shape of the data. E.g. raw audio to audio features will add one dimension.
  • Make another dedicated dataset just for such transformation, similarly as MetaDataset. Or maybe make it part of MetaDataset?
  • Make this a separate mechanism in RETURNN, which we can apply after the dataset logic? Which e.g. operates on TensorDict (before batching), or on individual data streams. But the distinction when something should be done as part of the dataset and when it would be done as such post-processing would be kind of arbitrary.
@JackTemaki
Copy link
Collaborator

I would be very interested in this functionality, but I did not put too much thought in yet on how this would look best. I am definitely a fan to use MetaDataset for everything, so I would not bother if it would be part of that.

@curufinwe
Copy link
Collaborator

Commenting on your 3 suggestions in order:

  1. Not very scalable, having to add this to every dataset makes it unnecessarily complicated.
  2. The dataset classes contain more logic than just returning a given sequence. They also contain logic for sequence ordering etc.. This is not needed for a postprocessing pipeline. One could make a dataset for post-processing which takes an arbitrary function as input, but then you still have to nest your datasets correctly and the dataset dict will have a lot of nesting. That's not nice, but maybe of secondary concern.
  3. I like the idea of having a separate postprocessing function that is applied to the dataset output independent of the dataset most. It's orthogonal to the data-loading logic and does not make the dataset dict more complicated than it is right now. Only disadvantage I can see is that it adds a new global key to the config that users need to be aware of to use. They don't find out about this by just browsing through the list of available datasets or the documentation of a particular dataset.

@albertz
Copy link
Member Author

albertz commented Jan 31, 2024

  1. Not very scalable, having to add this to every dataset makes it unnecessarily complicated.

Well, you would factor this out, to have the logic in some common class PostProcessing, and it would be very easy to integrate this into an existing dataset. Similarly like Vocabulary or ExtractAudioFeatures is currently also easy to integrate. And as a starting point, maybe only HDFDataset, OggZipDataset and/or MetaDataset would have this.

But yes, I kind of agree.

  1. The dataset classes contain more logic than just returning a given sequence. They also contain logic for sequence ordering etc.

Well, we do use combinations/transformations of datasets already, e.g. see MetaDataset, CombinedDataset, ConcatSeqsDataset, etc. Only one of the datasets performs the sequence ordering then, and the other datasets will be given a predefined sequence order list. The API was specifically extended to allow that (init_seq_order(self, epoch=None, seq_list=None, seq_order=None)) and all datasets we use support this well.

So, such postprocessing logic would not really add anything new there - it fits very natural in how MetaDataset and co work currently.

  1. separate postprocessing function ... Only disadvantage I can see is that it adds a new global key

There is also another aspect which becomes ambiguous: The extern_data. Does this describe the data before or after the post processing? I.e. does this describe the data which comes out of the dataset, or the data which goes into the network/model? I think there are multiple places in our code which assumes this to be the same. E.g. we have some checks that the dataset output is compatible to extern_data. And then, we create such extern_data with the actual batched data to feed it to the model (in TF via feeding the extern_data placeholders, in PT just directly).

Note, we also have the model_outputs, which describes what comes out of the model. So maybe, if there is a custom post-process function, it requires that model_inputs is also defined, and extern_data describes the dataset output? I don't know... Or if the user does not specify model_inputs, it would assume the same format as extern_data (for many post-processing functions, e.g. speed perturbation etc, it would not change the format of the data).

@curufinwe
Copy link
Collaborator

Regarding 1:
OK, I guess we agree not to do it this way.
Regarding 2:
My main worry is that readability will suffer if we have to many nested datasets, i.e. a PostProcessingDataset wrapping a MetaDataset wrapping multiple HDFDatasets.
Regarding 3:
extern_data should describe the data after preprocessing I think. It's the only type/shape that the model sees. Regarding the checks: I guess they need to be reworked, but there shouldn't be many places were that happens, right?

@albertz
Copy link
Member Author

albertz commented Feb 1, 2024

@JackTemaki argued, he (and many others) anyway use MetaDataset, so if this is part of MetaDataset, not much would change in terms of config complexity for them.

Regarding extern_data, yes, probably you are right, and mostly extern_data describes the input to the model, and there are not many places where it checks that the dataset output matches it.

So, how would that variant look like? Just a function in the config, like:

def dataset_post_process(data: TensorDict) -> TensorDict: ...

?
But we have the train dataset, dev/eval datasets, and potentially also other datasets (forwarding, search, some custom user script, ...). Should the same post processing be used in all cases? Or should this be a property per dataset?

@curufinwe
Copy link
Collaborator

We definetly need to distinguish train and dev datasets. If we do data augmentation for training we don't necessarily want to do it for cross-validation. If we are doing some sort of format conversion then it would be needed for both. So in the end this should be a user choice. We can also have the type of dataset (train/dev/...) as one argument to the post-processing function.

@albertz
Copy link
Member Author

albertz commented Feb 1, 2024

How would the user specify such post-processing function per dataset?

It could be another argument for the dataset itself, so the user specifies it like:

train = {
   ...,
   "post_process": my_train_dataset_post_proc,
}
dev = {
   ...,
   "post_process": my_dev_dataset_post_proc,
}

It's a bit ugly, because Dataset itself then would not use this attribute at all, but only outside code would use it. But on the other side, if the user would specify it somehow outside of the dataset, as a separate config option, this could easily lead to errors, as you don't necessarily have a well defined name for the dataset. E.g. like:

dataset_post_process_funcs = {
    "train": my_train_dataset_post_proc,
    "dev": my_dev_dataset_post_proc,
}

This is maybe fine for the training task, but for search or forward, it's ambiguous, and also it doesn't really work if RETURNN is used for scripting and not as a standalone tool.

@curufinwe
Copy link
Collaborator

Could we add this to (every) engine class? The engine knows what kind of task it performs and what dataloader it uses for that task and could pick the correct post-processing function for that task.

@albertz
Copy link
Member Author

albertz commented Feb 2, 2024

The post-processing function is not per task but per dataset. At least that is what I wrote above.

Or do you want to have it per task? But I guess you don't really want it per task, but rather whether you train or eval? Or maybe a generic dataset_post_process_func, but there is always an additional extra argument train, which specifies the train flag (similar as dropout etc work, depending on this flag)? So like this:

def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ...

@curufinwe
Copy link
Collaborator

Sorry, I was not precise enough. What I meant was that in the engine class you know for what the dataset is used and from which name in the config it comes from (I hope). Then one can select the correct postprocessing function to go with it (by picking it from the dict you showed above with the correct key).
But that might indeed be too complicated and having the train flag could be already enough.

@albertz
Copy link
Member Author

albertz commented Feb 2, 2024

in the engine class you know for what the dataset is used and from which name in the config it comes from (I hope)

No, you don't. E.g. we have this API for forward:

def forward_with_callback(self, *, dataset: Dataset, callback: ForwardCallbackIface): ...

Or this API for the init (including training):

    def init_train_from_config(
        self,
        config: Optional[Config] = None,
        train_data: Optional[Dataset] = None,
        dev_data: Optional[Dataset] = None,
        eval_data: Optional[Dataset] = None,
    ): ...

It is handled by our __main__ to init the datasets here (train_data etc).
(Although, it's a bit inconsistent. The engine anyway has a reference to the config, and in init_train_from_config, it also checks eval_datasets directly from the config, to potentially init further eval datasets.)

But yes, so I guess we can simply use this API:

def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ...

@curufinwe
Copy link
Collaborator

OK, let's use
def dataset_post_process(data: TensorDict, *, train: bool = False) -> TensorDict: ...

@albertz
Copy link
Member Author

albertz commented Feb 6, 2024

One aspect I realized now: Where exactly would this be executed? As this is now outside the dataset, MultiProcDataset cannot really make use of this, so it cannot be parallelized over the workers of MultiProcDataset. That's a bit suboptimal, at least for my setup. But it can still run somewhere inside the PyTorch pipeline, i.e. then be parallelized over the PyTorch DataLoader worker instances, although most people don't parallelize there (num_workers 0 or 1).

@vieting
Copy link
Contributor

vieting commented Feb 21, 2024

I would also be interested in this feature. The discussed post processing solution seems fine to me. However, I would definitely like to have the post processing parallelizable into multiple procs. At least now, I have a setup with an OggZipDataset using pre_process which is wrapped by a MultiProcDataset. The MultiProcDataset gave a clear improvement in terms of computing time and therefore general speedup of training, so this would be very helpful. Maybe we could additionally allow to specify dataset_post_process_opts which could contain similar args as the MultiProcDataset takes and that automatically apply a similar multi processing logic.

@albertz
Copy link
Member Author

albertz commented Apr 11, 2024

Another aspect came up (@Judyxujj): We were interested in implementing mixup in this post processing function. But this is not really possible with the current design. This additionally needs:

  • Some way to store state (the buffer of past sequences)
  • Extra code to synchronize states (in distributed training settings, but also when multi processing is done)
  • Store the state on disk? Or just reset it at the beginning of each epoch?

(Note, I have a mixup implementation, but I did it inside the model, via a nn.Module (or rf.Module). The state is handled in the same way as e.g. nn.BatchNorm.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants