Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

litdata.optimize accidentally deletes files from the local filesystem #93

Open
hubertsiuzdak opened this issue Apr 5, 2024 · 2 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@hubertsiuzdak
Copy link

馃悰 Bug

When filepaths are passed as inputs to litdata.optimize, it attempts to resolve input_dir. This input_dir is later used in DataWorker to cache these files and manage cleanup.

But _get_input_dir is very error-prone, as it only looks at the first element of inputs:

indexed_paths = _get_indexed_paths(inputs[0])

and assumes that input_dir is always three directories deep from the root:

return "/" + os.path.join(*str(absolute_path).split("/")[:4])

However, if our input files that don't follow these assumptions, e.g. come from different top-level directories, it can really mess things up. That's because when clearing the cache, filepaths are determined simply by replacing input_dir with cache_dir:

for path in paths:
if input_dir:
if not path.startswith(cache_dir) and input_dir.path is not None:
path = path.replace(input_dir.path, cache_dir)
if os.path.exists(path):
os.remove(path)

But if input_dir.path is not in path, replace does nothing, and then it just proceeds to delete a valid file! Removing these paths should be done with much more caution.

To Reproduce

Create a directory and ensure python can save to it:

sudo mkdir /mnt/litdata-example
sudo chmod 777 /mnt/litdata-example/

Then run a simple python script:

import os
import uuid
from glob import glob

import litdata
import torch

base_dir = "/mnt/litdata-example"

for _ in range(4):  # create 4 random directories with pytorch tensors
    dir_name = os.path.join(base_dir, str(uuid.uuid4())[:8])
    os.makedirs(dir_name, exist_ok=True)
    torch.save(torch.randn(4, 4), os.path.join(dir_name, "tensor.pt"))  # Save a random pytorch tensor

files_before = glob(os.path.join(base_dir, "*/*.pt"))
print(files_before)  # print the paths of the saved tensors to confirm creation

litdata.optimize(fn=lambda x: x, inputs=files_before, output_dir="output_dir", num_workers=1, chunk_bytes="64MB")

files_after = glob(os.path.join(base_dir, "*/*.pt"))
print(files_after)  # some files are gone! 馃憢
assert len(files_before) == len(files_after)

And yes... this actually happened to me. I was quite astonished to see some of my files just deleted 馃く

Environment

  • litdata==0.2.3

Additional context

Is caching input files in litdata.optimize actually necessary? The most common use case is to retrieve a file only once during dataset preparation. If we simply set an empty input directory input_dir = Dir() in DataProcessor, we can avoid all of this.

@hubertsiuzdak hubertsiuzdak added bug Something isn't working help wanted Extra attention is needed labels Apr 5, 2024
Copy link

github-actions bot commented Apr 5, 2024

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Apr 5, 2024

Omg @hubertsiuzdak, dear apology for this. This should happen only on the Lightning AI platform. I will disable this behaviour if you are running it outside of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants