Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset infos in yaml #4926

Merged
merged 41 commits into from Oct 3, 2022
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
a62535f
wip
lhoestq Aug 31, 2022
2e47856
fix Features yaml
lhoestq Sep 2, 2022
cb2c650
splits to yaml
lhoestq Sep 2, 2022
192f23c
add _to_yaml_list
lhoestq Sep 2, 2022
e91eca6
style
lhoestq Sep 2, 2022
c250545
example: conll2000
lhoestq Sep 2, 2022
2d53abe
example: crime_and_punish
lhoestq Sep 2, 2022
e0bb069
add pyyaml dependency
lhoestq Sep 2, 2022
216bb7e
remove unused imports
lhoestq Sep 2, 2022
887d514
remove validation tests
lhoestq Sep 2, 2022
5e583ef
style
lhoestq Sep 2, 2022
247e3cf
allow dataset_infos to be struct or list in YAML
lhoestq Sep 2, 2022
0418808
fix test
lhoestq Sep 2, 2022
4e8912e
style
lhoestq Sep 2, 2022
52322cc
update "datasets-cli test" + remove "version"
lhoestq Sep 5, 2022
9ef750d
remove config definitions in conll2000 and crime_and_punish
lhoestq Sep 5, 2022
fce2cbb
remove versions for conll2000 and crime_and_punish
lhoestq Sep 5, 2022
9adee79
move conll2000 and cap dummy data
lhoestq Sep 5, 2022
8794104
Merge branch 'main' into dataset_infos-in-yaml
lhoestq Sep 5, 2022
c52f40f
fix test
lhoestq Sep 5, 2022
3d066ad
add tests
lhoestq Sep 7, 2022
92fed44
comments and tests
lhoestq Sep 7, 2022
3ca5026
more test
lhoestq Sep 7, 2022
a53cc05
don't mention the dataset_infos.json file in docs
lhoestq Sep 7, 2022
25a617c
Merge branch 'main' into dataset_infos-in-yaml
lhoestq Sep 9, 2022
08b64ae
Merge branch 'main' into dataset_infos-in-yaml
lhoestq Sep 12, 2022
3c940ca
nit in docs
lhoestq Sep 12, 2022
adfbc76
Merge branch 'main' into dataset_infos-in-yaml
lhoestq Sep 22, 2022
5d7249e
docs
lhoestq Sep 22, 2022
7740a65
dataset_infos -> dataset_info
lhoestq Sep 22, 2022
360c5ae
again
lhoestq Sep 22, 2022
1e5fc45
use id2label in class_label
lhoestq Sep 22, 2022
da414e6
update conll2000
lhoestq Sep 22, 2022
a1df085
fix utf-8 yaml dump
lhoestq Sep 23, 2022
02454c6
Merge branch 'main' into dataset_infos-in-yaml
lhoestq Sep 30, 2022
a304449
--save_infos -> --save_info
lhoestq Sep 30, 2022
f8e7d11
Apply suggestions from code review
lhoestq Sep 30, 2022
e75c0f6
style
lhoestq Sep 30, 2022
6e5cc09
fix reloading a single dataset_info
lhoestq Sep 30, 2022
251be69
push info to README.md in push_to_hub
lhoestq Sep 30, 2022
60bbac5
update test
lhoestq Sep 30, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 2 additions & 4 deletions .github/PULL_REQUEST_TEMPLATE/add_dataset.md
Expand Up @@ -6,11 +6,9 @@

### Checkbox

- [ ] Create the dataset script `/datasets/my_dataset/my_dataset.py` using the template
- [ ] Create the dataset script `./my_dataset/my_dataset.py` using the template
- [ ] Fill the `_DESCRIPTION` and `_CITATION` variables
- [ ] Implement `_infos()`, `_split_generators()` and `_generate_examples()`
lhoestq marked this conversation as resolved.
Show resolved Hide resolved
- [ ] Make sure that the `BUILDER_CONFIGS` class attribute is filled with the different configurations of the dataset and that the `BUILDER_CONFIG_CLASS` is specified if there is a custom config class.
- [ ] Generate the metadata file `dataset_infos.json` for all configurations
- [ ] Generate the dummy data `dummy_data.zip` files to have the dataset script tested and that they don't weigh too much (<50KB)
- [ ] Add the dataset card `README.md` using the template : fill the tags and the various paragraphs
- [ ] Both tests for the real data and the dummy data pass.
- [ ] Optional - test the dataset using `datasets-cli test ./dataset_name --save_infos`
10 changes: 5 additions & 5 deletions ADD_NEW_DATASET.md
Expand Up @@ -63,8 +63,6 @@ You are now ready to start the process of adding the dataset. We will create the

- a **dataset script** which contains the code to download and pre-process the dataset: e.g. `squad.py`,
- a **dataset card** with tags and information on the dataset in a `README.md`.
- a **metadata file** (automatically created) which contains checksums and information about the dataset to guarantee that the loading went fine: `dataset_infos.json`
- a **dummy-data file** (automatically created) which contains small examples from the original files to test and guarantee that the script is working well in the future: `dummy_data.zip`

2. Let's start by creating a new branch to hold your development changes with the name of your dataset:

Expand Down Expand Up @@ -166,7 +164,9 @@ Sometimes you need to use several *configurations* and/or *splits* (usually at l
- if some of you dataset features are in a fixed set of classes (e.g. labels), you should use a `ClassLabel` feature.


**Last step:** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command:
#### Tests (optional)

To check that your dataset works correctly and to create its `dataset_infos` metadata in the dataset card, run the command:
lhoestq marked this conversation as resolved.
Show resolved Hide resolved

```bash
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
Expand Down Expand Up @@ -229,13 +229,13 @@ Now that your dataset script runs and create a dataset with the format you expec
```
to enable the slow tests, instead of `RUN_SLOW=1`.

3. If all tests pass, your dataset works correctly. You can finally create the metadata JSON by running the command:
3. If all tests pass, your dataset works correctly. You can finally create the metadata by running the command:

```bash
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
```

This first command should create a `dataset_infos.json` file in your dataset folder.
This first command should create a `README.md` file in your dataset folder if it doesn't exist already, containing the metadata.
lhoestq marked this conversation as resolved.
Show resolved Hide resolved


You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎
Expand Down
92 changes: 91 additions & 1 deletion datasets/conll2000/README.md
Expand Up @@ -3,6 +3,96 @@ language:
- en
paperswithcode_id: conll-2000-1
pretty_name: CoNLL-2000
dataset_info:
features:
- name: id
dtype: string
- name: tokens
sequence: string
- name: pos_tags
sequence:
class_label:
names:
0: ''''''
1: '#'
2: $
3: (
4: )
5: ','
6: .
7: ':'
8: '``'
9: CC
10: CD
11: DT
12: EX
13: FW
14: IN
15: JJ
16: JJR
17: JJS
18: MD
19: NN
20: NNP
21: NNPS
22: NNS
23: PDT
24: POS
25: PRP
26: PRP$
27: RB
28: RBR
29: RBS
30: RP
31: SYM
32: TO
33: UH
34: VB
35: VBD
36: VBG
37: VBN
38: VBP
39: VBZ
40: WDT
41: WP
42: WP$
43: WRB
- name: chunk_tags
sequence:
class_label:
names:
0: O
1: B-ADJP
2: I-ADJP
3: B-ADVP
4: I-ADVP
5: B-CONJP
6: I-CONJP
7: B-INTJ
8: I-INTJ
9: B-LST
10: I-LST
11: B-NP
12: I-NP
13: B-PP
14: I-PP
15: B-PRT
16: I-PRT
17: B-SBAR
18: I-SBAR
19: B-UCP
20: I-UCP
21: B-VP
22: I-VP
splits:
- name: test
num_bytes: 1201151
num_examples: 2013
- name: train
num_bytes: 5356965
num_examples: 8937
download_size: 3481560
dataset_size: 6558116
---

# Dataset Card for "conll2000"
Expand Down Expand Up @@ -173,4 +263,4 @@ The data fields are the same among all splits.

### Contributions

Thanks to [@vblagoje](https://github.com/vblagoje), [@jplu](https://github.com/jplu) for adding this dataset.
Thanks to [@vblagoje](https://github.com/vblagoje), [@jplu](https://github.com/jplu) for adding this dataset.
16 changes: 0 additions & 16 deletions datasets/conll2000/conll2000.py
Expand Up @@ -53,25 +53,9 @@
_TEST_FILE = "test.txt"


class Conll2000Config(datasets.BuilderConfig):
"""BuilderConfig for Conll2000"""

def __init__(self, **kwargs):
"""BuilderConfig forConll2000.

Args:
**kwargs: keyword arguments forwarded to super.
"""
super(Conll2000Config, self).__init__(**kwargs)


class Conll2000(datasets.GeneratorBasedBuilder):
"""Conll2000 dataset."""

BUILDER_CONFIGS = [
Conll2000Config(name="conll2000", version=datasets.Version("1.0.0"), description="Conll2000 dataset"),
]

def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
Expand Down
1 change: 0 additions & 1 deletion datasets/conll2000/dataset_infos.json

This file was deleted.

12 changes: 11 additions & 1 deletion datasets/crime_and_punish/README.md
Expand Up @@ -3,6 +3,16 @@ language:
- en
paperswithcode_id: null
pretty_name: CrimeAndPunish
dataset_info:
dataset_size: 1270540
download_size: 1201735
features:
- dtype: string
name: line
splits:
- name: train
num_bytes: 1270540
num_examples: 21969
---

# Dataset Card for "crime_and_punish"
Expand Down Expand Up @@ -144,4 +154,4 @@ The data fields are the same among all splits.

### Contributions

Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
Thanks to [@patrickvonplaten](https://github.com/patrickvonplaten), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
46 changes: 7 additions & 39 deletions datasets/crime_and_punish/crime_and_punish.py
Expand Up @@ -8,36 +8,7 @@
_DATA_URL = "https://raw.githubusercontent.com/patrickvonplaten/datasets/master/crime_and_punishment.txt"


class CrimeAndPunishConfig(datasets.BuilderConfig):
"""BuilderConfig for Crime and Punish."""

def __init__(self, data_url, **kwargs):
"""BuilderConfig for BlogAuthorship

Args:
data_url: `string`, url to the dataset (word or raw level)
**kwargs: keyword arguments forwarded to super.
"""
super(CrimeAndPunishConfig, self).__init__(
version=datasets.Version(
"1.0.0",
),
**kwargs,
)
self.data_url = data_url


class CrimeAndPunish(datasets.GeneratorBasedBuilder):

VERSION = datasets.Version("0.1.0")
BUILDER_CONFIGS = [
CrimeAndPunishConfig(
name="crime-and-punish",
data_url=_DATA_URL,
description="word level dataset. No processing is needed other than replacing newlines with <eos> tokens.",
),
]

def _info(self):
return datasets.DatasetInfo(
# This is the description that will appear on the datasets page.
Expand All @@ -58,17 +29,14 @@ def _info(self):
def _split_generators(self, dl_manager):
"""Returns SplitGenerators."""

if self.config.name == "crime-and-punish":
data = dl_manager.download_and_extract(self.config.data_url)
data = dl_manager.download_and_extract(_DATA_URL)

return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"data_file": data, "split": "train"},
),
]
else:
raise ValueError(f"{self.config.name} does not exist")
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"data_file": data, "split": "train"},
),
]

def _generate_examples(self, data_file, split):

Expand Down
1 change: 0 additions & 1 deletion datasets/crime_and_punish/dataset_infos.json

This file was deleted.

85 changes: 84 additions & 1 deletion docs/source/dataset_card.mdx
Expand Up @@ -24,4 +24,87 @@ Feel free to take a look at these dataset card examples to help you get started:
- [CNN / DailyMail](https://huggingface.co/datasets/cnn_dailymail)
- [Allociné](https://huggingface.co/datasets/allocine)

You can also check out the (similar) documentation about [dataset cards on the Hub side](https://huggingface.co/docs/hub/datasets-cards).
You can also check out the (similar) documentation about [dataset cards on the Hub side](https://huggingface.co/docs/hub/datasets-cards).

## More YAML tags

You can use the `dataset_info` YAML fields to define additional metadata for the dataset. Here is an example for SQuAD:

```YAML
pretty_name: SQuAD
language:
- en
...
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: context
dtype: string
- name: question
dtype: string
- name: answers
sequence:
- name: text
dtype: string
- name: answer_start
dtype: int32
splits:
- name: train
num_bytes: 79346360
num_examples: 87599
- name: validation
num_bytes: 10473040
num_examples: 10570
download_size: 35142551
dataset_size: 89819400
```

These metadata used to be included in the `dataset_infos.json` file, which is now deprecated.

### Feature types

Using the `features` field you can explicitly define the feature types of your dataset.
This is especially useful when type inference is not obvious.
For example if there's only one non-empty example in a 1TB dataset, the type inference is not able to infer the type of each column without going through the full dataset.
In this case, specifying the `features` field makes type inference much easier.

### Split sizes

Specifying the split sizes with `num_examples` enables TQDM bars (otherwise it doesn't know how many examples are left).
It also enables integrity verifications: if the dataset doesn't have the right number of `num_examples`, an error is returned.

Additionally you can add `num_bytes` to specify how big each split is.

### Dataset size

When [`load_dataset`] is called, it first downloads the dataset raw data files, and then it prepares the dataset in Arrow format.

You can specify how many bytes are required to download the raw data files with `dataset_size`, and use `dataset_size` for the size of the dataset in Arrow format.

### Multiple configurations

Certain datasets like `glue` have several configurations (`cola`, `sst2`, etc.) that can be loaded using `load_dataset("glue", "cola")` for example.

Each configuration can have different features, splits and sizes.
You can specify those fields per configuration using a YAML list:

```YAML
dataset_info:
- config_name: cola
features:
...
splits:
...
download_size: ...
dataset_size: ...
- config_name: sst2
features:
...
splits:
...
download_size: ...
dataset_size: ...
```
8 changes: 4 additions & 4 deletions docs/source/dataset_script.mdx
Expand Up @@ -274,17 +274,17 @@ def _generate_examples(self, filepath):
}
```

## Generate dataset metadata
## (Optional) Generate dataset metadata

Adding dataset metadata is a great way to include information about your dataset. The metadata is stored in a `dataset_infos.json` file. It includes information like data file checksums, the number of examples required to confirm the dataset was correctly generated, and information about the dataset like its `features`.
Adding dataset metadata is a great way to include information about your dataset. The metadata is stored in the dataset card `README.md` in YAML. It includes information like the number of examples required to confirm the dataset was correctly generated, and information about the dataset like its `features`.

Run the following command to generate your dataset metadata in `dataset_infos.json` and make sure your new dataset loading script works correctly:
Run the following command to generate your dataset metadata in `README.md` and make sure your new dataset loading script works correctly:

```
datasets-cli test path/to/<your-dataset-loading-script> --save_infos --all_configs
```

If your dataset loading script passed the test, you should now have a `dataset_infos.json` file in your dataset folder.
If your dataset loading script passed the test, you should now have a `README.md` file in your dataset folder containing a `dataset_infos` field with some metadata.
lhoestq marked this conversation as resolved.
Show resolved Hide resolved

## Upload to the Hub

Expand Down