You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It's not uncommon to have to update the data one is training on or computing embeddings on etc. We could support appending data to an optimized (chunked) dataset.
Appending alone is sufficient, removal is more specific and can be performed by rerunning through the samples with a map and creating a new dataset from it.
Pitch
When we map or optimize specifying a target location, add ability to append to existing chunks (through a mode="append" argument or something along these lines).
Alternatives
The alternative is creating a new dataset and compose with the previous one during iteration. However composition is not trivial because we need to make sure we are drawing each sample once from each and then avoid bumping into StopIteration. We would need to add a specific mode to composed dataset.
If the data added is misaligned with the chunk size and appending happens often it would create a suboptimal dataset after a while, that would need to be compacted into a single one by iterating sequentially. This could be a further alternative: a utility where you pass a list of datasets and create a single dataset by iterating through all of them.
Description & Motivation
It's not uncommon to have to update the data one is training on or computing embeddings on etc. We could support appending data to an optimized (chunked) dataset.
Appending alone is sufficient, removal is more specific and can be performed by rerunning through the samples with a map and creating a new dataset from it.
Pitch
When we map or optimize specifying a target location, add ability to append to existing chunks (through a
mode="append"
argument or something along these lines).Alternatives
The alternative is creating a new dataset and compose with the previous one during iteration. However composition is not trivial because we need to make sure we are drawing each sample once from each and then avoid bumping into StopIteration. We would need to add a specific mode to composed dataset.
If the data added is misaligned with the chunk size and appending happens often it would create a suboptimal dataset after a while, that would need to be compacted into a single one by iterating sequentially. This could be a further alternative: a utility where you pass a list of datasets and create a single dataset by iterating through all of them.
Additional context
No response
cc @Borda @tchaton
Moved from Lightning-AI/pytorch-lightning#19519, submitted by @lantiga
The text was updated successfully, but these errors were encountered: