Dataset infos in yaml #4926

lhoestq · 2022-09-02T16:10:05Z

To simplify the addition of new datasets, we'd like to have the dataset infos in the YAML and deprecate the dataset_infos.json file. YAML is readable and easy to edit, and the YAML metadata of the readme already contain dataset metadata so we would have everything in one place.

To be more specific, I moved these fields from DatasetInfo to the YAML:

config_name (if there are several configs)
download_size
dataset_size
features
splits

Here is what I ended up with for squad:

dataset_info:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: answer_start
      dtype: int32
  splits:
  - name: train
    num_bytes: 79346360
    num_examples: 87599
  - name: validation
    num_bytes: 10473040
    num_examples: 10570
  config_name: plain_text
  download_size: 35142551
  dataset_size: 89819400

and it can be a list if there are several configs

I already did the change for conll2000 and crime_and_punish as an example.

Implementation details

Load/Read

This is done via DatasetInfosDict.write_to_directory/from_directory

I had to implement a custom the YAML export logic for SplitDict, Version and Features.
The first two are trivial, but the logic for Features is more complicated, because I added a simplification step (or the YAML would be too long and less readable): it's just a formatting step to remove unnecessary nesting of YAML data.

Other changes

I had to update the DatasetModule factories to also download the README.md alongside the dataset scripts/data files, and not just the dataset_infos.json

YAML validation

I removed the old validation code that was in metadata.py, now we can just use the Hub YAML validation

Datasets-cli

The datasets-cli test --save_infos command now creates a README.md file with the dataset_infos in it, instead of a datasets_infos.json file

Backward compatibility

dataset_infos.json files are still supported and loaded if they exist to have full backward compatibility.
Though I removed the unnecessary keys when the value is the default (like all the id: null from the Value feature types) to make them easier to read.

TODO

add comments
tests
document the new YAML fields
try to reload the new dataset_infos.json file content with an old version of datasets

EDITS

removed "config_name" when there's only one config
removed "version" for now (?), because it's not useful in general
renamed the yaml field "dataset_info" instead of "dataset_infos", since it only has one by default (and because "infos" is not english)

Fix #4876

HuggingFaceDocBuilderDev · 2022-09-02T16:23:42Z

The documentation is not available anymore as the PR was closed or merged.

julien-c · 2022-09-02T19:27:37Z

datasets/conll2000/README.md

@@ -3,6 +3,98 @@ language:
 - en
 paperswithcode_id: conll-2000-1
 pretty_name: CoNLL-2000
+dataset_infos:
+- config_name: conll2000


do we need the (non-default) config_name for backward compatibility?

The dataset explicitly defines one configuration with this name, but since there's only one there's no ambiguity. Let me tweak the YAML loading a bit and we can remove this line :)

I removed config name for conll2000 in the YAML part, and for everything to connect smoothly I had to remove the definition of the config in the dataset script itself. I'll see if it's something I can do for all the datasets that have one configuration

lhoestq

I modified the class_label YAML dump structure to show the label ids. This is more practical this way, see conll2000.

This is ready for review ! :)

cc @albertvillanova @polinaeterna @mariosasko WDYT ?

I can generate the YAML for all the other datasets in a subsequent PR

lhoestq · 2022-09-23T14:54:28Z

src/datasets/utils/py_utils.py

+                if not f.init or value != f.default or f.metadata.get("include_in_asdict_even_if_is_default", False):
+                    result[f.name] = value


To simplify the JSON and YAML dumps, I stop dumping the dataclass attributes that have are the default value.

e.g. decode=True for Image, or length=-1 for Sequence

lhoestq · 2022-09-23T18:10:13Z

Created #5018 where I added the YAML dataset_info of every single dataset in this repo

see other dataset cards: imagenet-1k, glue, flores, gem

mariosasko

Great job!

One nit: the metadata generation in push_to_hub also needs to be updated, no?

PS: We should also probably use DatasetInfo from hugginface_hub instead of having our own implementation in info.py, but this can be addressed later.

polinaeterna

love this change! i added a few suggestions/fixes for documentation :)

polinaeterna · 2022-09-28T11:44:08Z

src/datasets/commands/test.py

@@ -50,7 +50,9 @@ def register_subcommand(parser: ArgumentParser):
            help="Can be used to specify a manual directory to get the files from.",
        )
        test_parser.add_argument("--all_configs", action="store_true", help="Test all dataset configurations")
-        test_parser.add_argument("--save_infos", action="store_true", help="Save the dataset infos file")
+        test_parser.add_argument(
+            "--save_infos", action="store_true", help="Save the dataset infos in the dataset card (README.md)"


maybe change --save_infos to --save_info to be consistent with dataset_info instead of dataset_infos for users? should then be changed in documentation and docstrings and the code below too, if you agree with this change

Good idea ! I changed to --save_info and kept --save_infos as an alias

polinaeterna · 2022-09-28T17:10:03Z

src/datasets/info.py

@@ -33,8 +33,9 @@
 import json
 import os
 import posixpath
-from dataclasses import dataclass, field
-from typing import Dict, List, Optional, Union
+from dataclasses import dataclass


just curious: what's the reason for not importing field explicitly and using just field instead of dataclasses.field below like it was before?

field is used as a variable names at several places - this is just to avoid collisions

ADD_NEW_DATASET.md

docs/source/dataset_script.mdx

.github/PULL_REQUEST_TEMPLATE/add_dataset.md

Co-authored-by: Polina Kazakova <polina@huggingface.co>

lhoestq · 2022-09-30T14:47:56Z

Took your comments into account and updated push_to_hub to push the dataset_info to the README.md instead of json :) Let me know if it sounds good to you now !

mariosasko

Looks all good now!

lhoestq added 9 commits August 31, 2022 19:29

wip

a62535f

fix Features yaml

2e47856

splits to yaml

cb2c650

add _to_yaml_list

192f23c

style

e91eca6

example: conll2000

c250545

example: crime_and_punish

2d53abe

add pyyaml dependency

e0bb069

remove unused imports

216bb7e

lhoestq added 5 commits September 2, 2022 18:51

remove validation tests

887d514

style

5e583ef

allow dataset_infos to be struct or list in YAML

247e3cf

fix test

0418808

style

4e8912e

julien-c reviewed Sep 2, 2022

View reviewed changes

lhoestq added 6 commits September 5, 2022 18:52

update "datasets-cli test" + remove "version"

52322cc

remove config definitions in conll2000 and crime_and_punish

9ef750d

remove versions for conll2000 and crime_and_punish

fce2cbb

move conll2000 and cap dummy data

9adee79

Merge branch 'main' into dataset_infos-in-yaml

8794104

fix test

c52f40f

This was referenced Sep 6, 2022

Docs for creating an audio dataset #4872

Merged

Add cc-by-nc-2.0 to list of licenses #4930

Merged

lhoestq added 6 commits September 7, 2022 15:12

add tests

3d066ad

comments and tests

92fed44

more test

3ca5026

don't mention the dataset_infos.json file in docs

a53cc05

Merge branch 'main' into dataset_infos-in-yaml

25a617c

Merge branch 'main' into dataset_infos-in-yaml

08b64ae

albertvillanova mentioned this pull request Sep 21, 2022

Remove license tag file and validation #5004

Merged

lhoestq mentioned this pull request Sep 22, 2022

Error loading StonyBrookNLP/tellmewhy dataset from hub even though local copy loads correctly #5009

Closed

lhoestq added 6 commits September 22, 2022 15:25

Merge branch 'main' into dataset_infos-in-yaml

adfbc76

docs

5d7249e

dataset_infos -> dataset_info

7740a65

again

360c5ae

use id2label in class_label

1e5fc45

update conll2000

da414e6

lhoestq commented Sep 23, 2022

View reviewed changes

fix utf-8 yaml dump

a1df085

lhoestq mentioned this pull request Sep 23, 2022

Create all YAML dataset_info #5018

Closed

albertvillanova added the dataset contribution Contribution to a dataset script label Sep 24, 2022

mariosasko reviewed Sep 29, 2022

View reviewed changes

polinaeterna reviewed Sep 29, 2022

View reviewed changes

lhoestq and others added 7 commits September 30, 2022 11:37

Merge branch 'main' into dataset_infos-in-yaml

02454c6

--save_infos -> --save_info

a304449

Apply suggestions from code review

f8e7d11

Co-authored-by: Polina Kazakova <polina@huggingface.co>

style

e75c0f6

fix reloading a single dataset_info

6e5cc09

push info to README.md in push_to_hub

251be69

update test

60bbac5

mariosasko approved these changes Sep 30, 2022

View reviewed changes

lhoestq merged commit 67e65c9 into main Oct 3, 2022

lhoestq deleted the dataset_infos-in-yaml branch October 3, 2022 09:11

lhoestq mentioned this pull request Oct 3, 2022

Fix backward compatibility for dataset_infos.json #5055

Merged

lhoestq mentioned this pull request Oct 12, 2022

Fix task template reload from dict #5106

Merged

albertvillanova mentioned this pull request Nov 3, 2022

Fix docs about dataset_info in YAML #5194

Merged

severo mentioned this pull request Dec 23, 2022

Error when pushing to the CI hub #5390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset infos in yaml #4926

Dataset infos in yaml #4926

lhoestq commented Sep 2, 2022 •

edited

HuggingFaceDocBuilderDev commented Sep 2, 2022 •

edited

julien-c Sep 2, 2022

lhoestq Sep 5, 2022

lhoestq Sep 7, 2022

lhoestq left a comment •

edited

lhoestq Sep 23, 2022

lhoestq commented Sep 23, 2022 •

edited

mariosasko left a comment

polinaeterna left a comment

polinaeterna Sep 28, 2022

lhoestq Sep 30, 2022

polinaeterna Sep 28, 2022 •

edited

lhoestq Sep 30, 2022

lhoestq commented Sep 30, 2022 •

edited

mariosasko left a comment

		if not f.init or value != f.default or f.metadata.get("include_in_asdict_even_if_is_default", False):
		result[f.name] = value

Dataset infos in yaml #4926

Dataset infos in yaml #4926

Conversation

lhoestq commented Sep 2, 2022 • edited

Implementation details

Load/Read

Other changes

YAML validation

Datasets-cli

Backward compatibility

TODO

EDITS

HuggingFaceDocBuilderDev commented Sep 2, 2022 • edited

julien-c Sep 2, 2022

Choose a reason for hiding this comment

lhoestq Sep 5, 2022

Choose a reason for hiding this comment

lhoestq Sep 7, 2022

Choose a reason for hiding this comment

lhoestq left a comment • edited

Choose a reason for hiding this comment

lhoestq Sep 23, 2022

Choose a reason for hiding this comment

lhoestq commented Sep 23, 2022 • edited

mariosasko left a comment

Choose a reason for hiding this comment

polinaeterna left a comment

Choose a reason for hiding this comment

polinaeterna Sep 28, 2022

Choose a reason for hiding this comment

lhoestq Sep 30, 2022

Choose a reason for hiding this comment

polinaeterna Sep 28, 2022 • edited

Choose a reason for hiding this comment

lhoestq Sep 30, 2022

Choose a reason for hiding this comment

lhoestq commented Sep 30, 2022 • edited

mariosasko left a comment

Choose a reason for hiding this comment

lhoestq commented Sep 2, 2022 •

edited

HuggingFaceDocBuilderDev commented Sep 2, 2022 •

edited

lhoestq left a comment •

edited

lhoestq commented Sep 23, 2022 •

edited

polinaeterna Sep 28, 2022 •

edited

lhoestq commented Sep 30, 2022 •

edited