Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset on Hub re-downloads every time? #6773

Closed
manestay opened this issue Apr 2, 2024 · 5 comments
Closed

Dataset on Hub re-downloads every time? #6773

manestay opened this issue Apr 2, 2024 · 5 comments

Comments

@manestay
Copy link

manestay commented Apr 2, 2024

Describe the bug

Hi, I have a dataset on the hub here. It has 1k+ downloads, which I sure is mostly just me and my colleagues working with it. It should have far fewer, since I'm using the same machine with a properly set up HF_HOME variable. However, whenever I run the below function load_borderlines_hf, it downloads the entire dataset from the hub and then does the other logic:
https://github.com/manestay/borderlines/blob/4e161f444661e2ebfe643f3fe149d9258d63a57d/run_gpt/lib.py#L80

Let me know what I'm doing wrong here, or if it's a bug with the datasets library itself. On the hub I have my data stored in CSVs, but several columns are lists, so that's why I have the code to map splitting on ;. I looked into dataset loading scripts, but it seemed difficult to set up. I have verified that other datasets and models on my system are using the cache properly (e.g. I have a 13B parameter model and large datasets, but those are cached and don't redownload).

__EDIT: __ as pointed out in the discussion below, it may be the map() calls that aren't being cached properly. Supposing the load_dataset() retrieve from the cache, then it should be the case that the map() calls also retrieve from the cached output. But the map() commands re-execute sometimes.

Steps to reproduce the bug

  1. Copy and paste the function from here (lines 80-100)
  2. Run it in Python load_borderlines_hf(None)
  3. It completes successfully, downloading from HF hub, then doing the mapping logic etc.
  4. If you run it again after some time, it will re-download, ignoring the cache

Expected behavior

Re-running the code, which calls datasets.load_dataset('manestay/borderlines', 'territories'), should use the cached version

Environment info

  • datasets version: 2.16.1
  • Platform: Linux-5.14.21-150500.55.7-default-x86_64-with-glibc2.31
  • Python version: 3.10.13
  • huggingface_hub version: 0.20.3
  • PyArrow version: 15.0.0
  • Pandas version: 1.5.3
  • fsspec version: 2023.10.0
@mariosasko
Copy link
Collaborator

The caching works as expected when I try to reproduce this locally or on Colab...

@manestay
Copy link
Author

manestay commented Apr 3, 2024

hi @mariosasko , Thank you for checking. I also tried running this again just now, and it seems like the load_dataset() caches properly (though I'll double check later).

I think the issue might be in the caching of the function output for territories.map(lambda row: {'Claimants': row['Claimants'].split(';')}). My current run re-ran this, even though I have run this many times before, and as demonstrated by loading from cache, the loaded dataset is the same.

I wonder if the issue stems from using CSV output. Do you recommend changing to Parquet, and if so, is there an easy way to take the already uploaded data on the Hub and reformat?

@mariosasko
Copy link
Collaborator

This issue seems similar to #6184 (dill serializes objects defined outside the __main__ module by reference). You should be able to work around this limitation by defining the lambdas outside of load_borderlines_hf (as module variables) and then setting their __module__ attribute's value to None to force serializing them by value, e.g., like this:

split_Claimants_row = lambda row: {'Claimants': row['Claimants'].split(';')}
split_Claimants_row.__module__ = None

@manestay
Copy link
Author

manestay commented Apr 4, 2024

Thank you, I'll give this a try. Your fix makes sense to me, so this issue can be closed for now.

Unrelated comment -- for "Downloads last month" on the hub page, I'm assuming for this project that each downloaded CSV is 1 download? The dataset consists of 51 CSVs, so I'm trying to see why it's incrementing so quickly (1125 2 days ago, 1246 right now).

@mariosasko
Copy link
Collaborator

This doc explains how we count "Downloads last month": https://huggingface.co/docs/hub/datasets-download-stats

@manestay manestay closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants