Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset infos in yaml #4926

Merged
merged 41 commits into from Oct 3, 2022
Merged

Dataset infos in yaml #4926

merged 41 commits into from Oct 3, 2022

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Sep 2, 2022

To simplify the addition of new datasets, we'd like to have the dataset infos in the YAML and deprecate the dataset_infos.json file. YAML is readable and easy to edit, and the YAML metadata of the readme already contain dataset metadata so we would have everything in one place.

To be more specific, I moved these fields from DatasetInfo to the YAML:

  • config_name (if there are several configs)
  • download_size
  • dataset_size
  • features
  • splits

Here is what I ended up with for squad:

dataset_info:
  features:
  - name: id
    dtype: string
  - name: title
    dtype: string
  - name: context
    dtype: string
  - name: question
    dtype: string
  - name: answers
    sequence:
    - name: text
      dtype: string
    - name: answer_start
      dtype: int32
  splits:
  - name: train
    num_bytes: 79346360
    num_examples: 87599
  - name: validation
    num_bytes: 10473040
    num_examples: 10570
  config_name: plain_text
  download_size: 35142551
  dataset_size: 89819400

and it can be a list if there are several configs

I already did the change for conll2000 and crime_and_punish as an example.

Implementation details

Load/Read

This is done via DatasetInfosDict.write_to_directory/from_directory

I had to implement a custom the YAML export logic for SplitDict, Version and Features.
The first two are trivial, but the logic for Features is more complicated, because I added a simplification step (or the YAML would be too long and less readable): it's just a formatting step to remove unnecessary nesting of YAML data.

Other changes

I had to update the DatasetModule factories to also download the README.md alongside the dataset scripts/data files, and not just the dataset_infos.json

YAML validation

I removed the old validation code that was in metadata.py, now we can just use the Hub YAML validation

Datasets-cli

The datasets-cli test --save_infos command now creates a README.md file with the dataset_infos in it, instead of a datasets_infos.json file

Backward compatibility

dataset_infos.json files are still supported and loaded if they exist to have full backward compatibility.
Though I removed the unnecessary keys when the value is the default (like all the id: null from the Value feature types) to make them easier to read.

TODO

  • add comments
  • tests
  • document the new YAML fields
  • try to reload the new dataset_infos.json file content with an old version of datasets

EDITS

  • removed "config_name" when there's only one config
  • removed "version" for now (?), because it's not useful in general
  • renamed the yaml field "dataset_info" instead of "dataset_infos", since it only has one by default (and because "infos" is not english)

Fix #4876

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 2, 2022

The documentation is not available anymore as the PR was closed or merged.

@@ -3,6 +3,98 @@ language:
- en
paperswithcode_id: conll-2000-1
pretty_name: CoNLL-2000
dataset_infos:
- config_name: conll2000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the (non-default) config_name for backward compatibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset explicitly defines one configuration with this name, but since there's only one there's no ambiguity. Let me tweak the YAML loading a bit and we can remove this line :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed config name for conll2000 in the YAML part, and for everything to connect smoothly I had to remove the definition of the config in the dataset script itself. I'll see if it's something I can do for all the datasets that have one configuration

Copy link
Member Author

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the class_label YAML dump structure to show the label ids. This is more practical this way, see conll2000.

This is ready for review ! :)

cc @albertvillanova @polinaeterna @mariosasko WDYT ?

I can generate the YAML for all the other datasets in a subsequent PR

Comment on lines +170 to +171
if not f.init or value != f.default or f.metadata.get("include_in_asdict_even_if_is_default", False):
result[f.name] = value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify the JSON and YAML dumps, I stop dumping the dataclass attributes that have are the default value.

e.g. decode=True for Image, or length=-1 for Sequence

@lhoestq
Copy link
Member Author

lhoestq commented Sep 23, 2022

Created #5018 where I added the YAML dataset_info of every single dataset in this repo

see other dataset cards: imagenet-1k, glue, flores, gem

@albertvillanova albertvillanova added the dataset contribution Contribution to a dataset script label Sep 24, 2022
Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

One nit: the metadata generation in push_to_hub also needs to be updated, no?

PS: We should also probably use DatasetInfo from hugginface_hub instead of having our own implementation in info.py, but this can be addressed later.

Copy link
Contributor

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this change! i added a few suggestions/fixes for documentation :)

@@ -50,7 +50,9 @@ def register_subcommand(parser: ArgumentParser):
help="Can be used to specify a manual directory to get the files from.",
)
test_parser.add_argument("--all_configs", action="store_true", help="Test all dataset configurations")
test_parser.add_argument("--save_infos", action="store_true", help="Save the dataset infos file")
test_parser.add_argument(
"--save_infos", action="store_true", help="Save the dataset infos in the dataset card (README.md)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change --save_infos to --save_info to be consistent with dataset_info instead of dataset_infos for users? should then be changed in documentation and docstrings and the code below too, if you agree with this change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea ! I changed to --save_info and kept --save_infos as an alias

@@ -33,8 +33,9 @@
import json
import os
import posixpath
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Union
from dataclasses import dataclass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: what's the reason for not importing field explicitly and using just field instead of dataclasses.field below like it was before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field is used as a variable names at several places - this is just to avoid collisions

ADD_NEW_DATASET.md Outdated Show resolved Hide resolved
ADD_NEW_DATASET.md Outdated Show resolved Hide resolved
docs/source/dataset_script.mdx Outdated Show resolved Hide resolved
.github/PULL_REQUEST_TEMPLATE/add_dataset.md Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Sep 30, 2022

Took your comments into account and updated push_to_hub to push the dataset_info to the README.md instead of json :) Let me know if it sounds good to you now !

Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset contribution Contribution to a dataset script
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move DatasetInfo from datasets_infos.json to the YAML tags in README.md
6 participants