Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset batching like ESPnet support #1506

Open
albertz opened this issue Jan 31, 2024 · 0 comments
Open

Dataset batching like ESPnet support #1506

albertz opened this issue Jan 31, 2024 · 0 comments

Comments

@albertz
Copy link
Member

albertz commented Jan 31, 2024

ESPnet basically does it like this:

  • Sort the whole dataset. (The dataset could maybe be directly stored in a sorted way. This would speed up the random access later.)
  • Creates batches for the whole dataset (or just some meta information on what sequences would come together in a batch). This has only minimal padding now. And also, we know the number of batches in advance.
  • Shuffle the order of the batches, i.e. randomly sample from the batches.

It's not clear whether this is really better than what we do, but in any case, it would be good if we could support this scheme as well, just for comparison.

For reference, the ESPnet code:

The question is, how to implement this in RETURNN? Our shuffling logic currently happens before the batching, not after.

Some options:

  • Some offline processing of the dataset, which builds such table for the list of batches.
    • The dataset sequence order would use this table, and shuffle only across batches, i.e. keep seqs of the same batch together.
    • The batching logic would also use this table.
  • The dataset sequence ordering has another mode specifically for this. But then this must know exactly about the batching logic/parameters (batch size etc). In the batching afterwards, we probably also should have some sanity checks that the batching is as it was expected by the sequence ordering.
  • The dataset sequence ordering has another mode for this, and we also introduce a new way that the dataset sequence ordering can already prepare the batches and store them in some new list _seq_order_prepared_batches or so.
  • Maybe some new way for the user to provide custom sequence ordering + batching in a combined way, via some custom user function, where it is all up to the user.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant