Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression using the optimize function from litdata #97

Open
rakro101 opened this issue Apr 11, 2024 · 5 comments
Open

Compression using the optimize function from litdata #97

rakro101 opened this issue Apr 11, 2024 · 5 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@rakro101
Copy link

rakro101 commented Apr 11, 2024

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Just use the https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming.

Modify it to using litdata, instead of lightning.data

Add to the optimize function in convert.py any compression method for example "zstd".

Code sample

optimize(convert_parquet_to_lightning_data, parquet_files[:10], output_dir, num_workers=os.cpu_count(), chunk_bytes="64MB", compression="gzip")

Expected behavior

Just creation of the compressed shards.
Error:

Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 426, in run
    self._setup()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 436, in _setup
    self._create_cache()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/processing/data_processor.py", line 511, in _create_cache
    self.cache = Cache(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/cache.py", line 65, in __init__
    self._writer = BinaryWriter(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litdata/streaming/writer.py", line 85, in __init__
    raise ValueError("No compresion algorithms are installed.")
ValueError: No compresion algorithms are installed.

Environment

lightning-ai, with pip install litdata

Additional context

Check the size of the dataset (compressed and uncompressed - in my first implementation on aws, i got same size for the data set.

@rakro101 rakro101 added bug Something isn't working help wanted Extra attention is needed labels Apr 11, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@rakro101
Copy link
Author

Using pip install mosaicml-streaming resolves the error above, maybe some dependencies should be added to litdata.

@rakro101
Copy link
Author

rakro101 commented Apr 11, 2024

Using then zstd -> and exucting the stream.py => Finished data processing!
⚡ ~ /home/zeus/miniconda3/envs/cloudspace/bin/python /teamspace/studios/this_studio/stream.py
8200

Traceback (most recent call last):
  File "/teamspace/studios/this_studio/stream.py", line 19, in <module>
    print(f'{dataset[0]}')
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/dataset.py", line 244, in __getitem__
    return self.cache[index]
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/cache.py", line 132, in __getitem__
    return self._reader.read(index)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/reader.py", line 252, in read
    item = self._item_loader.load_item_from_chunk(index.index, index.chunk_index, chunk_filepath, begin)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 104, in load_item_from_chunk
    return self.deserialize(data)
  File "/teamspace/studios/this_studio/pytorch-lightning/src/lightning/data/streaming/item_loader.py", line 116, in deserialize
    return tree_unflatten(data, self._config["data_spec"])
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/_pytree.py", line 261, in tree_unflatten
    raise ValueError(
ValueError: tree_unflatten(values, spec): `values` has length 0 but the spec refers to a pytree that holds 4 items (TreeSpec(tuple, None, [*,
  *,
  *,
  *]))

@tchaton
Copy link
Collaborator

tchaton commented Apr 11, 2024

Hey @rakro101, I published a new version. Can you try again ?

@rakro101
Copy link
Author

@tchaton it works now, but the ending should be .zstd instead of .bin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants