litdata.optimize
accidentally deletes files from the local filesystem
#93
Labels
litdata.optimize
accidentally deletes files from the local filesystem
#93
馃悰 Bug
When filepaths are passed as inputs to
litdata.optimize
, it attempts to resolveinput_dir
. Thisinput_dir
is later used inDataWorker
to cache these files and manage cleanup.But
_get_input_dir
is very error-prone, as it only looks at the first element ofinputs
:litdata/src/litdata/processing/functions.py
Line 53 in ee69581
and assumes that
input_dir
is always three directories deep from the root:litdata/src/litdata/processing/functions.py
Line 71 in ee69581
However, if our input files that don't follow these assumptions, e.g. come from different top-level directories, it can really mess things up. That's because when clearing the cache, filepaths are determined simply by replacing
input_dir
withcache_dir
:litdata/src/litdata/processing/data_processor.py
Lines 198 to 204 in ee69581
But if
input_dir.path
is not inpath
,replace
does nothing, and then it just proceeds to delete a valid file! Removing these paths should be done with much more caution.To Reproduce
Create a directory and ensure python can save to it:
Then run a simple python script:
And yes... this actually happened to me. I was quite astonished to see some of my files just deleted 馃く
Environment
Additional context
Is caching input files in
litdata.optimize
actually necessary? The most common use case is to retrieve a file only once during dataset preparation. If we simply set an empty input directoryinput_dir = Dir()
inDataProcessor
, we can avoid all of this.The text was updated successfully, but these errors were encountered: