Compression using the optimize function from litdata #97

rakro101 · 2024-04-11T07:21:07Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Just use the https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming.

Modify it to using litdata, instead of lightning.data

Add to the optimize function in convert.py any compression method for example "zstd".

Code sample

optimize(convert_parquet_to_lightning_data, parquet_files[:10], output_dir, num_workers=os.cpu_count(), chunk_bytes="64MB", compression="gzip")

Expected behavior

Just creation of the compressed shards.
Error:

Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 426, in run
    self._setup()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 436, in _setup
    self._create_cache()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 511, in _create_cache
    self.cache = Cache(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/cache.py", line 65, in __init__
    self._writer = BinaryWriter(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/writer.py", line 85, in __init__
    raise ValueError("No compresion algorithms are installed.")
ValueError: No compresion algorithms are installed.

Environment

lightning-ai, with pip install litdata

Additional context

Check the size of the dataset (compressed and uncompressed - in my first implementation on aws, i got same size for the data set.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-11T07:21:28Z

Hi! thanks for your contribution!, great first issue!

rakro101 · 2024-04-11T07:26:23Z

Using pip install mosaicml-streaming resolves the error above, maybe some dependencies should be added to litdata.

rakro101 · 2024-04-11T07:28:39Z

Using then zstd -> and exucting the stream.py => Finished data processing!
⚡ ~ /home/zeus/miniconda3/envs/cloudspace/bin/python /teamspace/studios/this_studio/stream.py
8200

Traceback (most recent call last):
  File "/teamspace/studios/this_studio/stream.py", line 19, in <module>
    print(f'{dataset[0]}')
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/dataset.py", line 244, in __getitem__
    return self.cache[index]
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/cache.py", line 132, in __getitem__
    return self._reader.read(index)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/reader.py", line 252, in read
    item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 104, in load_item_from_chunk
    return self.deserialize(data)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 116, in deserialize
    return tree_unflatten(data, self._config["data_spec"])
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/_pytree.py", line 261, in tree_unflatten
    raise ValueError(
ValueError: tree_unflatten(values, spec): `values` has length 0 but the spec refers to a pytree that holds 4 items (TreeSpec(tuple, None, [*,
  *,
  *,
  *]))

tchaton · 2024-04-11T08:35:21Z

Hey @rakro101, I published a new version. Can you try again ?

rakro101 · 2024-04-22T08:41:19Z

@tchaton it works now, but the ending should be .zstd instead of .bin

rakro101 added bug Something isn't working help wanted Extra attention is needed labels Apr 11, 2024

wzf03 mentioned this issue May 11, 2024

Fix infinite sleep when loading local compressed dataset. #127

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression using the optimize function from litdata #97

Compression using the optimize function from litdata #97

rakro101 commented Apr 11, 2024 •

edited by Borda

github-actions bot commented Apr 11, 2024

rakro101 commented Apr 11, 2024

rakro101 commented Apr 11, 2024 •

edited by Borda

tchaton commented Apr 11, 2024

rakro101 commented Apr 22, 2024

Compression using the optimize function from litdata #97

Compression using the optimize function from litdata #97

Comments

rakro101 commented Apr 11, 2024 • edited by Borda

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Apr 11, 2024

rakro101 commented Apr 11, 2024

rakro101 commented Apr 11, 2024 • edited by Borda

tchaton commented Apr 11, 2024

rakro101 commented Apr 22, 2024

rakro101 commented Apr 11, 2024 •

edited by Borda

rakro101 commented Apr 11, 2024 •

edited by Borda