Dataset on Hub re-downloads every time? #6773

manestay · 2024-04-02T17:23:22Z

Describe the bug

Hi, I have a dataset on the hub here. It has 1k+ downloads, which I sure is mostly just me and my colleagues working with it. It should have far fewer, since I'm using the same machine with a properly set up HF_HOME variable. However, whenever I run the below function load_borderlines_hf, it downloads the entire dataset from the hub and then does the other logic:
https://github.com/manestay/borderlines/blob/4e161f444661e2ebfe643f3fe149d9258d63a57d/run_gpt/lib.py#L80

Let me know what I'm doing wrong here, or if it's a bug with the datasets library itself. On the hub I have my data stored in CSVs, but several columns are lists, so that's why I have the code to map splitting on ;. I looked into dataset loading scripts, but it seemed difficult to set up. I have verified that other datasets and models on my system are using the cache properly (e.g. I have a 13B parameter model and large datasets, but those are cached and don't redownload).

__EDIT: __ as pointed out in the discussion below, it may be the map() calls that aren't being cached properly. Supposing the load_dataset() retrieve from the cache, then it should be the case that the map() calls also retrieve from the cached output. But the map() commands re-execute sometimes.

Steps to reproduce the bug

Copy and paste the function from here (lines 80-100)
Run it in Python load_borderlines_hf(None)
It completes successfully, downloading from HF hub, then doing the mapping logic etc.
If you run it again after some time, it will re-download, ignoring the cache

Expected behavior

Re-running the code, which calls datasets.load_dataset('manestay/borderlines', 'territories'), should use the cached version

Environment info

datasets version: 2.16.1
Platform: Linux-5.14.21-150500.55.7-default-x86_64-with-glibc2.31
Python version: 3.10.13
huggingface_hub version: 0.20.3
PyArrow version: 15.0.0
Pandas version: 1.5.3
fsspec version: 2023.10.0

The text was updated successfully, but these errors were encountered:

mariosasko · 2024-04-03T13:37:05Z

The caching works as expected when I try to reproduce this locally or on Colab...

manestay · 2024-04-03T21:01:06Z

hi @mariosasko , Thank you for checking. I also tried running this again just now, and it seems like the load_dataset() caches properly (though I'll double check later).

I think the issue might be in the caching of the function output for territories.map(lambda row: {'Claimants': row['Claimants'].split(';')}). My current run re-ran this, even though I have run this many times before, and as demonstrated by loading from cache, the loaded dataset is the same.

I wonder if the issue stems from using CSV output. Do you recommend changing to Parquet, and if so, is there an easy way to take the already uploaded data on the Hub and reformat?

mariosasko · 2024-04-04T14:17:37Z

This issue seems similar to #6184 (dill serializes objects defined outside the __main__ module by reference). You should be able to work around this limitation by defining the lambdas outside of load_borderlines_hf (as module variables) and then setting their __module__ attribute's value to None to force serializing them by value, e.g., like this:

split_Claimants_row = lambda row: {'Claimants': row['Claimants'].split(';')}
split_Claimants_row.__module__ = None

manestay · 2024-04-04T16:53:03Z

Thank you, I'll give this a try. Your fix makes sense to me, so this issue can be closed for now.

Unrelated comment -- for "Downloads last month" on the hub page, I'm assuming for this project that each downloaded CSV is 1 download? The dataset consists of 51 CSVs, so I'm trying to see why it's incrementing so quickly (1125 2 days ago, 1246 right now).

mariosasko · 2024-04-04T17:20:17Z

This doc explains how we count "Downloads last month": https://huggingface.co/docs/hub/datasets-download-stats

manestay closed this as completed Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset on Hub re-downloads every time? #6773

Dataset on Hub re-downloads every time? #6773

manestay commented Apr 2, 2024 •

edited

mariosasko commented Apr 3, 2024

manestay commented Apr 3, 2024

mariosasko commented Apr 4, 2024

manestay commented Apr 4, 2024

mariosasko commented Apr 4, 2024

Dataset on Hub re-downloads every time? #6773

Dataset on Hub re-downloads every time? #6773

Comments

manestay commented Apr 2, 2024 • edited

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Apr 3, 2024

manestay commented Apr 3, 2024

mariosasko commented Apr 4, 2024

manestay commented Apr 4, 2024

mariosasko commented Apr 4, 2024

manestay commented Apr 2, 2024 •

edited