Assert when deserializing `no_header_numpy` or `no_header_tensor`. #92

ouj · 2024-04-04T21:12:52Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Create/serialize a dataset with integer tensor or numpy.
Read/deserialize the created dataset.

Code sample

from litdata import optimize
import numpy as np
from litdata.streaming import StreamingDataLoader, StreamingDataset


def random_images(index):
    data = {
        "index": index,  # int data type
        "class": np.arange(1, 100),  # numpy array data type
    }
    # The data is serialized into bytes and stored into data chunks by the optimize operator.
    return data  # The data is serialized into bytes and stored into data chunks by the optimize operator.


if __name__ == "__main__":
    optimize(
        fn=random_images,  # The function applied over each input.
        inputs=list(range(10)),  # Provide any inputs. The fn is applied on each item.
        output_dir="my_optimized_dataset",  # The directory where the optimized data are stored.
        num_workers=0,  # The number of workers. The inputs are distributed among them.
        chunk_bytes="64MB",  # The maximum number of bytes to write into a data chunk.
    )

    dataset = StreamingDataset("my_optimized_dataset", shuffle=False, drop_last=False)
    dataloader = StreamingDataLoader(
        dataset,
        num_workers=0,
        batch_size=1,
        drop_last=False,
        shuffle=False,
    )

    for data in dataloader:
        print(data)

Expected behavior

Read and print the batch data.

Environment

PyTorch Version (e.g., 1.0): 2.1.2
OS (e.g., Linux): MacOS and Linux
How you installed PyTorch (conda, pip, source): pip install
Build command you used (if compiling from source):
Python version: 3.11
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
Any other relevant information:

Additional context

Assert stack

Traceback (most recent call last):
  File "/Users/jou2/work/./test_optimize.py", line 33, in <module>
    for data in dataloader:
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataloader.py", line 598, in __iter__
    for batch in super().__iter__():
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 298, in __next__
    data = self.__getitem__(
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 268, in __getitem__
    return self.cache[index]
           ~~~~~~~~~~^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/cache.py", line 135, in __getitem__
    return self._reader.read(index)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/reader.py", line 252, in read
    item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin, chunk_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 110, in load_item_from_chunk
    return self.deserialize(data)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 129, in deserialize
    data.append(serializer.deserialize(data_bytes))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/serializers.py", line 261, in deserialize
    assert self._dtype
AssertionError

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-04T21:13:18Z

Hi! thanks for your contribution!, great first issue!

ouj · 2024-04-04T21:14:51Z

Looks like the setup() method on NoHeaderTensorSerializer and NoHeaderNumpySerializer wasn't called before deserialize was called.

ouj · 2024-04-04T21:27:11Z

Okay... found a workaround. The problem is the the numpy array is a 1D array.

The fix is to reshape that to a 2D array to create an "header"? 🤯

np.arange(10).reshape(1, -1)

tchaton · 2024-04-05T15:24:15Z

Hey @ouj. Yes, 1D data is handled differently to handle tokens for training LLMs. This isn't a nice behaviour and I meant to provide a better mechanism but I never got to it.

ouj added bug Something isn't working help wanted Extra attention is needed labels Apr 4, 2024

Borda changed the title ~~Assert when deserializing no_header_numpy or no_header_tensor.~~ Assert when deserializing no_header_numpy or no_header_tensor. Apr 5, 2024

enrico-stauss mentioned this issue May 7, 2024

Fix the NoHeaderTensorSerializer for 1D tensors (other than tokens) #124

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assert when deserializing `no_header_numpy` or `no_header_tensor`. #92

Assert when deserializing `no_header_numpy` or `no_header_tensor`. #92

ouj commented Apr 4, 2024 •

edited by Borda

github-actions bot commented Apr 4, 2024

ouj commented Apr 4, 2024

ouj commented Apr 4, 2024

tchaton commented Apr 5, 2024

Assert when deserializing no_header_numpy or no_header_tensor. #92

Assert when deserializing no_header_numpy or no_header_tensor. #92

Comments

ouj commented Apr 4, 2024 • edited by Borda

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Apr 4, 2024

ouj commented Apr 4, 2024

ouj commented Apr 4, 2024

tchaton commented Apr 5, 2024

Assert when deserializing `no_header_numpy` or `no_header_tensor`. #92

Assert when deserializing `no_header_numpy` or `no_header_tensor`. #92

ouj commented Apr 4, 2024 •

edited by Borda