You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create/serialize a dataset with integer tensor or numpy.
Read/deserialize the created dataset.
Code sample
fromlitdataimportoptimizeimportnumpyasnpfromlitdata.streamingimportStreamingDataLoader, StreamingDatasetdefrandom_images(index):
data= {
"index": index, # int data type"class": np.arange(1, 100), # numpy array data type
}
# The data is serialized into bytes and stored into data chunks by the optimize operator.returndata# The data is serialized into bytes and stored into data chunks by the optimize operator.if__name__=="__main__":
optimize(
fn=random_images, # The function applied over each input.inputs=list(range(10)), # Provide any inputs. The fn is applied on each item.output_dir="my_optimized_dataset", # The directory where the optimized data are stored.num_workers=0, # The number of workers. The inputs are distributed among them.chunk_bytes="64MB", # The maximum number of bytes to write into a data chunk.
)
dataset=StreamingDataset("my_optimized_dataset", shuffle=False, drop_last=False)
dataloader=StreamingDataLoader(
dataset,
num_workers=0,
batch_size=1,
drop_last=False,
shuffle=False,
)
fordataindataloader:
print(data)
Expected behavior
Read and print the batch data.
Environment
PyTorch Version (e.g., 1.0): 2.1.2
OS (e.g., Linux): MacOS and Linux
How you installed PyTorch (conda, pip, source): pip install
Build command you used (if compiling from source):
Python version: 3.11
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
Any other relevant information:
Additional context
Assert stack
Traceback (most recent call last):
File "/Users/jou2/work/./test_optimize.py", line 33, in <module>
for data in dataloader:
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataloader.py", line 598, in __iter__
for batch in super().__iter__():
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 298, in __next__
data = self.__getitem__(
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 268, in __getitem__
return self.cache[index]
~~~~~~~~~~^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/cache.py", line 135, in __getitem__
return self._reader.read(index)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/reader.py", line 252, in read
item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin, chunk_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 110, in load_item_from_chunk
return self.deserialize(data)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 129, in deserialize
data.append(serializer.deserialize(data_bytes))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/litdata/streaming/serializers.py", line 261, in deserialize
assert self._dtype
AssertionError
The text was updated successfully, but these errors were encountered:
Okay... found a workaround. The problem is the the numpy array is a 1D array.
The fix is to reshape that to a 2D array to create an "header"? 馃く
np.arange(10).reshape(1, -1)
Borda
changed the title
Assert when deserializing no_header_numpy or no_header_tensor.
Assert when deserializing no_header_numpy or no_header_tensor.
Apr 5, 2024
Hey @ouj. Yes, 1D data is handled differently to handle tokens for training LLMs. This isn't a nice behaviour and I meant to provide a better mechanism but I never got to it.
馃悰 Bug
To Reproduce
Steps to reproduce the behavior:
Code sample
Expected behavior
Read and print the batch data.
Environment
conda
,pip
, source): pip installAdditional context
Assert stack
The text was updated successfully, but these errors were encountered: