Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert when deserializing no_header_numpy or no_header_tensor. #92

Open
ouj opened this issue Apr 4, 2024 · 4 comments
Open

Assert when deserializing no_header_numpy or no_header_tensor. #92

ouj opened this issue Apr 4, 2024 · 4 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@ouj
Copy link

ouj commented Apr 4, 2024

馃悰 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Create/serialize a dataset with integer tensor or numpy.
  2. Read/deserialize the created dataset.

Code sample

from litdata import optimize
import numpy as np
from litdata.streaming import StreamingDataLoader, StreamingDataset


def random_images(index):
    data = {
        "index": index,  # int data type
        "class": np.arange(1, 100),  # numpy array data type
    }
    # The data is serialized into bytes and stored into data chunks by the optimize operator.
    return data  # The data is serialized into bytes and stored into data chunks by the optimize operator.


if __name__ == "__main__":
    optimize(
        fn=random_images,  # The function applied over each input.
        inputs=list(range(10)),  # Provide any inputs. The fn is applied on each item.
        output_dir="my_optimized_dataset",  # The directory where the optimized data are stored.
        num_workers=0,  # The number of workers. The inputs are distributed among them.
        chunk_bytes="64MB",  # The maximum number of bytes to write into a data chunk.
    )

    dataset = StreamingDataset("my_optimized_dataset", shuffle=False, drop_last=False)
    dataloader = StreamingDataLoader(
        dataset,
        num_workers=0,
        batch_size=1,
        drop_last=False,
        shuffle=False,
    )

    for data in dataloader:
        print(data)

Expected behavior

Read and print the batch data.

Environment

  • PyTorch Version (e.g., 1.0): 2.1.2
  • OS (e.g., Linux): MacOS and Linux
  • How you installed PyTorch (conda, pip, source): pip install
  • Build command you used (if compiling from source):
  • Python version: 3.11
  • CUDA/cuDNN version: N/A
  • GPU models and configuration: N/A
  • Any other relevant information:

Additional context

Assert stack

Traceback (most recent call last):
  File "/Users/jou2/work/./test_optimize.py", line 33, in <module>
    for data in dataloader:
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataloader.py", line 598, in __iter__
    for batch in super().__iter__():
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
    data.append(next(self.dataset_iter))
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 298, in __next__
    data = self.__getitem__(
           ^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 268, in __getitem__
    return self.cache[index]
           ~~~~~~~~~~^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/cache.py", line 135, in __getitem__
    return self._reader.read(index)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/reader.py", line 252, in read
    item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin, chunk_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 110, in load_item_from_chunk
    return self.deserialize(data)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 129, in deserialize
    data.append(serializer.deserialize(data_bytes))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/serializers.py", line 261, in deserialize
    assert self._dtype
AssertionError
@ouj ouj added bug Something isn't working help wanted Extra attention is needed labels Apr 4, 2024
Copy link

github-actions bot commented Apr 4, 2024

Hi! thanks for your contribution!, great first issue!

@ouj
Copy link
Author

ouj commented Apr 4, 2024

Looks like the setup() method on NoHeaderTensorSerializer and NoHeaderNumpySerializer wasn't called before deserialize was called.

@ouj
Copy link
Author

ouj commented Apr 4, 2024

Okay... found a workaround. The problem is the the numpy array is a 1D array.

The fix is to reshape that to a 2D array to create an "header"? 馃く

np.arange(10).reshape(1, -1)

@Borda Borda changed the title Assert when deserializing no_header_numpy or no_header_tensor. Assert when deserializing no_header_numpy or no_header_tensor. Apr 5, 2024
@tchaton
Copy link
Collaborator

tchaton commented Apr 5, 2024

Hey @ouj. Yes, 1D data is handled differently to handle tokens for training LLMs. This isn't a nice behaviour and I meant to provide a better mechanism but I never got to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants