Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache directory resolution issues in Google Colab #126

Open
awaelchli opened this issue May 8, 2024 · 1 comment
Open

Cache directory resolution issues in Google Colab #126

awaelchli opened this issue May 8, 2024 · 1 comment
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@awaelchli
Copy link
Member

awaelchli commented May 8, 2024

馃悰 Bug

In Google Colab, the cache dir resolution leads to a directory error when using the optimize function.

To Reproduce

pip install litdata==0.2.6

Minimal repro in Colab

Code sample

import os

# save some files
os.makedirs("my_data", exist_ok=True)
with open("my_data/file.txt", "w") as file:
    file.write("Test")
import numpy as np
from litdata import optimize


def process(filename):
    with open(filename, "r"):
        pass  # do some processing
    return np.array([1, 2, 3])

if __name__ == "__main__":
    optimize(
        fn=process,
        inputs=["my_data/file.txt"],
        output_dir="my_optimized_dataset",
        chunk_bytes="64MB"
    )

raises the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/litdata/processing/data_processor.py", line 626, in _handle_data_chunk_recipe
    item_data_or_generator = self.data_recipe.prepare_item(current_item)
  File "/usr/local/lib/python3.10/dist-packages/litdata/processing/functions.py", line 148, in _prepare_item
    return self._fn(item_metadata)
  File "<ipython-input-3-f77cf781dbef>", line 6, in process
    with open(filename, "r"):
IsADirectoryError: [Errno 21] Is a directory: '/tmp/data'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/litdata/processing/data_processor.py", line 423, in run
    self._loop()
  File "/usr/local/lib/python3.10/dist-packages/litdata/processing/data_processor.py", line 472, in _loop
    self._handle_data_chunk_recipe(index)
  File "/usr/local/lib/python3.10/dist-packages/litdata/processing/data_processor.py", line 638, in _handle_data_chunk_recipe
    raise RuntimeError(f"Failed processing {self.items[index]}") from e
RuntimeError: Failed processing /tmp/data

Expected behavior

This works locally and in Studios, so we would also expect it to work in Google Colab.

Environment

  • PyTorch Version (e.g., 1.0): N/A
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): N/A
  • Build command you used (if compiling from source): N/A
  • Python version: 3.11
  • CUDA/cuDNN version: N/A
  • GPU models and configuration: N/A
  • Any other relevant information: N/A

Additional context

The issue was raised originally in LitGPT:
Lightning-AI/litgpt#1402

@awaelchli awaelchli added bug Something isn't working help wanted Extra attention is needed labels May 8, 2024
@tchaton
Copy link
Collaborator

tchaton commented May 9, 2024

@awaelchli Mind contributing a fix ? Should be easy using the env variables from colab and changing the default cache dir.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants