-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset on Hub re-downloads every time? #6773
Comments
The caching works as expected when I try to reproduce this locally or on Colab... |
hi @mariosasko , Thank you for checking. I also tried running this again just now, and it seems like the I think the issue might be in the caching of the function output for I wonder if the issue stems from using CSV output. Do you recommend changing to Parquet, and if so, is there an easy way to take the already uploaded data on the Hub and reformat? |
This issue seems similar to #6184 ( split_Claimants_row = lambda row: {'Claimants': row['Claimants'].split(';')}
split_Claimants_row.__module__ = None |
Thank you, I'll give this a try. Your fix makes sense to me, so this issue can be closed for now. Unrelated comment -- for "Downloads last month" on the hub page, I'm assuming for this project that each downloaded CSV is 1 download? The dataset consists of 51 CSVs, so I'm trying to see why it's incrementing so quickly (1125 2 days ago, 1246 right now). |
This doc explains how we count "Downloads last month": https://huggingface.co/docs/hub/datasets-download-stats |
Describe the bug
Hi, I have a dataset on the hub here. It has 1k+ downloads, which I sure is mostly just me and my colleagues working with it. It should have far fewer, since I'm using the same machine with a properly set up HF_HOME variable. However, whenever I run the below function
load_borderlines_hf
, it downloads the entire dataset from the hub and then does the other logic:https://github.com/manestay/borderlines/blob/4e161f444661e2ebfe643f3fe149d9258d63a57d/run_gpt/lib.py#L80
Let me know what I'm doing wrong here, or if it's a bug with the
datasets
library itself. On the hub I have my data stored in CSVs, but several columns are lists, so that's why I have the code to map splitting on;
. I looked into dataset loading scripts, but it seemed difficult to set up. I have verified that otherdatasets
andmodels
on my system are using the cache properly (e.g. I have a 13B parameter model and large datasets, but those are cached and don't redownload).__EDIT: __ as pointed out in the discussion below, it may be the
map()
calls that aren't being cached properly. Supposing theload_dataset()
retrieve from the cache, then it should be the case that themap()
calls also retrieve from the cached output. But themap()
commands re-execute sometimes.Steps to reproduce the bug
load_borderlines_hf(None)
Expected behavior
Re-running the code, which calls
datasets.load_dataset('manestay/borderlines', 'territories')
, should use the cached versionEnvironment info
datasets
version: 2.16.1huggingface_hub
version: 0.20.3fsspec
version: 2023.10.0The text was updated successfully, but these errors were encountered: