Load GitHub datasets from Hub #4059

albertvillanova · 2022-03-30T09:21:56Z

We have recurrently had connection errors when requesting GitHub because sometimes the site is not available.

This PR requests the Hub instead, once all GitHub datasets are mirrored on the Hub.

Fix #2048

Related to:

HuggingFaceDocBuilderDev · 2022-03-30T09:32:15Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-03-31T14:01:02Z

Currently the github datasets versioning is synced with the datasets lib versioning: when you load a github dataset using datasets==x.y.z, then the version of the dataset will be the one at the git tag x.y.z. This is for reproducibility reasons.

We could stop having this behavior and always use the latest version of the dataset, but when we do a breaking change it will break github datasets for previous versions of the library. It could be nice to think about tools that will allow backward compatibility if we ever need to to a breaking change in some datasets. Maybe a way to specify which revision of the dataset to use based on the datasets major version.

If we keep this behavior, then maybe add a note in setup.py to push to PyPI only after the Update Hub repositories CI job is done. It can take a few minutes to add the version tag to all the dataset repositories on the Hub. If we push to PyPI before the tags are pushed, then some users might get some 404 if at the same time they installed datasets and run load_dataset.

albertvillanova · 2022-03-31T14:31:32Z

@lhoestq I was going to increase the max_retries as done for metrics:

Increase max retries for GitHub metrics #4063

But then I realized that loading from the Hub would work as well. That is why I opened this PR.

Definitely, we should decide which behavior we want:

We have been working in the direction of eliminating the distinctions between canonical/community datasets
If we continue to go in that direction, then passing (or not passing) revision should have the same behavior for canonical/community
If we want to continue to tight the library version with the canonical datasets version, that is definitely a difference between canonical and community datasets

Not sure what could be better in the long term...

albertvillanova · 2022-03-31T14:35:51Z

We could stop having this behavior and always use the latest version of the dataset, but when we do a breaking change it will break github datasets for previous versions of the library.

Not sure of understanding this. Previous versions of the datasets library will continue to download GitHub datasets from GitHub, syncing library/dataset versions... Where is the problem?

lhoestq · 2022-03-31T14:44:44Z

Yes you're right, previous versions of datasets will still continue to download from github, but not future versions.
If we release datasets 2.1 by removing this behavior and if one day we release datasets 3.0 with a breaking change in the dataset scripts, then all version >=2.1 will break.

lhoestq · 2022-03-31T14:48:08Z

Ideally we should drop the differences between github datasets and community datasets, and maybe provide a way to fallback on an older version of a dataset repository if the user's datasets version is too old and incompatible with it.

lhoestq · 2022-09-13T16:09:49Z

I just noticed I literally opened the same PR lol

I'm still convinced that we should do a better version compatibility check but we can see that later IMO

albertvillanova · 2022-09-15T05:29:00Z

Normally in open source projects, when there is a duplicate PR, the latter is tagged as "duplicate" and closed. 😜

Let me make things clear in my mind: so you say that the blocking point that was preventing this PR from merging, now is no longer a blocking point and could be addresses in a subsequent PR?

lhoestq · 2022-09-15T15:24:12Z

Let me close the duplicate one, sorry

Let me make things clear my mind: so you say that the blocking point that was preventing this PR from merging now is no longer a blocking point and could be addresses in a subsequent PR?

Yes 🙈

…s-from-hub

lhoestq

Cool ! LGTM :)

Finally we'll remove the differences between Hub datasets and GitHub datasets ^^

(Note that after this PR, all the changes made to a dataset will affect all the datasets version from now on)

albertvillanova · 2022-09-16T12:39:30Z

Note that after this PR, all the changes made to a dataset will affect all the datasets version from now on

Yes, we have aligned this behavior with Hub datasets, as this is already the case for Hub datasets.

albertvillanova added 2 commits March 30, 2022 10:47

Test loading GitHub dataset from Hub

fae366b

Load GitHub datasets from Hub

ac61fdd

albertvillanova requested a review from lhoestq March 30, 2022 09:21

albertvillanova added 2 commits March 30, 2022 11:45

Fix tests

7f8d988

Fix style

14e322f

albertvillanova mentioned this pull request Mar 30, 2022

Increase max retries for GitHub metrics #4063

Merged

albertvillanova mentioned this pull request Apr 1, 2022

Increase max retries for GitHub datasets #4079

Merged

This was referenced Sep 13, 2022

[GH->HF] Part 2: Remove all dataset scripts from github #4974

Merged

[GH->HF] Load datasets from the Hub #4973

Closed

albertvillanova added 2 commits September 16, 2022 10:52

Merge remote-tracking branch 'upstream/main' into load-github-dataset…

80f4fbc

…s-from-hub

Fix test after merge

dddb47c

lhoestq approved these changes Sep 16, 2022

View reviewed changes

albertvillanova merged commit 5b23f58 into main Sep 16, 2022

albertvillanova deleted the load-github-datasets-from-hub branch September 16, 2022 12:40

This was referenced Sep 26, 2022

Fix languages of X-CSQA configs in xcsr dataset #5022

Merged

Fix bug with labels of eurlex config of lex_glue dataset #5048

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load GitHub datasets from Hub #4059

Load GitHub datasets from Hub #4059

albertvillanova commented Mar 30, 2022

HuggingFaceDocBuilderDev commented Mar 30, 2022 •

edited

lhoestq commented Mar 31, 2022 •

edited

albertvillanova commented Mar 31, 2022 •

edited

albertvillanova commented Mar 31, 2022 •

edited

lhoestq commented Mar 31, 2022

lhoestq commented Mar 31, 2022

lhoestq commented Sep 13, 2022 •

edited

albertvillanova commented Sep 15, 2022 •

edited

lhoestq commented Sep 15, 2022 •

edited

lhoestq left a comment

albertvillanova commented Sep 16, 2022 •

edited

Load GitHub datasets from Hub #4059

Load GitHub datasets from Hub #4059

Conversation

albertvillanova commented Mar 30, 2022

HuggingFaceDocBuilderDev commented Mar 30, 2022 • edited

lhoestq commented Mar 31, 2022 • edited

albertvillanova commented Mar 31, 2022 • edited

albertvillanova commented Mar 31, 2022 • edited

lhoestq commented Mar 31, 2022

lhoestq commented Mar 31, 2022

lhoestq commented Sep 13, 2022 • edited

albertvillanova commented Sep 15, 2022 • edited

lhoestq commented Sep 15, 2022 • edited

lhoestq left a comment

Choose a reason for hiding this comment

albertvillanova commented Sep 16, 2022 • edited

HuggingFaceDocBuilderDev commented Mar 30, 2022 •

edited

lhoestq commented Mar 31, 2022 •

edited

albertvillanova commented Mar 31, 2022 •

edited

albertvillanova commented Mar 31, 2022 •

edited

lhoestq commented Sep 13, 2022 •

edited

albertvillanova commented Sep 15, 2022 •

edited

lhoestq commented Sep 15, 2022 •

edited

albertvillanova commented Sep 16, 2022 •

edited