Append data to pre-optimized dataset #23

tchaton · 2024-02-26T09:09:27Z

Description & Motivation

It's not uncommon to have to update the data one is training on or computing embeddings on etc. We could support appending data to an optimized (chunked) dataset.
Appending alone is sufficient, removal is more specific and can be performed by rerunning through the samples with a map and creating a new dataset from it.

Pitch

When we map or optimize specifying a target location, add ability to append to existing chunks (through a mode="append" argument or something along these lines).

Alternatives

The alternative is creating a new dataset and compose with the previous one during iteration. However composition is not trivial because we need to make sure we are drawing each sample once from each and then avoid bumping into StopIteration. We would need to add a specific mode to composed dataset.

If the data added is misaligned with the chunk size and appending happens often it would create a suboptimal dataset after a while, that would need to be compacted into a single one by iterating sequentially. This could be a further alternative: a utility where you pass a list of datasets and create a single dataset by iterating through all of them.

Additional context

No response

cc @Borda @tchaton

Moved from Lightning-AI/pytorch-lightning#19519, submitted by @lantiga

The text was updated successfully, but these errors were encountered:

github-actions · 2024-02-26T09:09:50Z

Hi! thanks for your contribution!, great first issue!

tchaton added enhancement New feature or request help wanted Extra attention is needed labels Feb 26, 2024

tchaton mentioned this issue Feb 26, 2024

Append data to pre-optimized dataset Lightning-AI/pytorch-lightning#19519

Closed

Borda removed the help wanted Extra attention is needed label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append data to pre-optimized dataset #23

Append data to pre-optimized dataset #23

tchaton commented Feb 26, 2024

github-actions bot commented Feb 26, 2024

Append data to pre-optimized dataset #23

Append data to pre-optimized dataset #23

Comments

tchaton commented Feb 26, 2024

Description & Motivation

Pitch

Alternatives

Additional context

github-actions bot commented Feb 26, 2024