Skip to content

Commit

Permalink
Merge branch 'main' of github.com:huggingface/datasets into drop-pyth…
Browse files Browse the repository at this point in the history
…on36
  • Loading branch information
mariosasko committed Jul 22, 2022
2 parents 4db3cf9 + 5088e95 commit c4b4cb6
Show file tree
Hide file tree
Showing 37 changed files with 790 additions and 660 deletions.
19 changes: 4 additions & 15 deletions datasets/crd3/README.md
Expand Up @@ -55,9 +55,6 @@ paperswithcode_id: crd3
- **Repository:** [CRD3 repository](https://github.com/RevanthRameshkumar/CRD3)
- **Paper:** [Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset](https://www.aclweb.org/anthology/2020.acl-main.459/)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 279.93 MB
- **Size of the generated dataset:** 4020.33 MB
- **Total amount of disk used:** 4300.25 MB

### Dataset Summary

Expand All @@ -69,6 +66,7 @@ collaboration and spoken interaction. For each dialogue, there are a large numbe
and semantic ties to the previous dialogues.

### Supported Tasks and Leaderboards

`summarization`: The dataset can be used to train a model for abstractive summarization. A [fast abstractive summarization-RL](https://github.com/ChenRocks/fast_abs_rl) model was presented as a baseline, which achieves ROUGE-L-F1 of 25.18.

### Languages
Expand All @@ -79,13 +77,8 @@ The text in the dataset is in English, as spoken by actors on The Critical Role

### Data Instances

#### default

- **Size of downloaded dataset files:** 279.93 MB
- **Size of the generated dataset:** 4020.33 MB
- **Total amount of disk used:** 4300.25 MB

An example of 'train' looks as follows.

```
{
"alignment_score": 3.679936647415161,
Expand All @@ -105,7 +98,6 @@ An example of 'train' looks as follows.

The data fields are the same among all splits.

#### default
- `chunk`: a `string` feature.
- `chunk_id`: a `int32` feature.
- `turn_start`: a `int32` feature.
Expand All @@ -120,7 +112,7 @@ The data fields are the same among all splits.

| name | train |validation| test |
|-------|------:|---------:|------:|
|default|26,232| 3,470|4,541|
|default|38,969| 6,327|7,500|

## Dataset Creation

Expand Down Expand Up @@ -180,19 +172,16 @@ This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 Inter

### Citation Information

```
```bibtex
@inproceedings{
title = {Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset},
author = {Rameshkumar, Revanth and Bailey, Peter},
year = {2020},
publisher = {Association for Computational Linguistics},
conference = {ACL}
}
```


### Contributions

Thanks to [@thomwolf](https://github.com/thomwolf), [@lhoestq](https://github.com/lhoestq), [@mariamabarham](https://github.com/mariamabarham), [@lewtun](https://github.com/lewtun) for adding this dataset.
20 changes: 11 additions & 9 deletions datasets/crd3/crd3.py
Expand Up @@ -45,11 +45,11 @@
and semantic ties to the previous dialogues.
"""

_URL = "https://github.com/RevanthRameshkumar/CRD3/archive/master.zip"
_URL = "https://huggingface.co/datasets/crd3/resolve/72bffe55b4d5bf19b530d3e417447b3384ba3673/data/aligned%20data.zip"


def get_train_test_dev_files(files, test_split, train_split, dev_split):
test_files = dev_files = train_files = []
test_files, dev_files, train_files = [], [], []
for file in files:
filename = os.path.split(file)[1].split("_")[0]
if filename in test_split:
Expand Down Expand Up @@ -88,20 +88,22 @@ def _info(self):
)

def _split_generators(self, dl_manager):
path = dl_manager.download_and_extract(_URL)
test_file = os.path.join(path, "CRD3-master", "data", "aligned data", "test_files")
train_file = os.path.join(path, "CRD3-master", "data", "aligned data", "train_files")
dev_file = os.path.join(path, "CRD3-master", "data", "aligned data", "val_files")
root = dl_manager.download_and_extract(_URL)
path = os.path.join(root, "aligned data")

test_file = os.path.join(path, "test_files")
train_file = os.path.join(path, "train_files")
dev_file = os.path.join(path, "val_files")
with open(test_file, encoding="utf-8") as f:
test_splits = [file.replace("\n", "") for file in f.readlines()]

with open(train_file, encoding="utf-8") as f:
train_splits = [file.replace("\n", "") for file in f.readlines()]
with open(dev_file, encoding="utf-8") as f:
dev_splits = [file.replace("\n", "") for file in f.readlines()]
c2 = "CRD3-master/data/aligned data/c=2"
c3 = "CRD3-master/data/aligned data/c=3"
c4 = "CRD3-master/data/aligned data/c=4"
c2 = "c=2"
c3 = "c=3"
c4 = "c=4"
files = [os.path.join(path, c2, file) for file in sorted(os.listdir(os.path.join(path, c2)))]
files.extend([os.path.join(path, c3, file) for file in sorted(os.listdir(os.path.join(path, c3)))])
files.extend([os.path.join(path, c4, file) for file in sorted(os.listdir(os.path.join(path, c4)))])
Expand Down
2 changes: 1 addition & 1 deletion datasets/crd3/dataset_infos.json
@@ -1 +1 @@
{"default": {"description": "\nStorytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset.\nCritical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons, an open-ended role-playing game.\nThe dataset is collected from 159 Critical Role episodes transcribed to text dialogues, consisting of 398,682 turns. It also includes corresponding\nabstractive summaries collected from the Fandom wiki. The dataset is linguistically unique in that the narratives are generated entirely through player\ncollaboration and spoken interaction. For each dialogue, there are a large number of turns, multiple abstractive summaries with varying levels of detail,\nand semantic ties to the previous dialogues.\n", "citation": "\n@inproceedings{\ntitle = {Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset},\nauthor = {Rameshkumar, Revanth and Bailey, Peter},\nyear = {2020},\npublisher = {Association for Computational Linguistics},\nconference = {ACL}\n}\n ", "homepage": "https://github.com/RevanthRameshkumar/CRD3", "license": "", "features": {"chunk": {"dtype": "string", "id": null, "_type": "Value"}, "chunk_id": {"dtype": "int32", "id": null, "_type": "Value"}, "turn_start": {"dtype": "int32", "id": null, "_type": "Value"}, "turn_end": {"dtype": "int32", "id": null, "_type": "Value"}, "alignment_score": {"dtype": "float32", "id": null, "_type": "Value"}, "turns": {"feature": {"names": {"dtype": "string", "id": null, "_type": "Value"}, "utterances": {"dtype": "string", "id": null, "_type": "Value"}, "number": {"dtype": "int32", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "crd3", "config_name": "default", "version": {"version_str": "0.0.0", "description": null, "major": 0, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 318560673, "num_examples": 52796, "dataset_name": "crd3"}, "test": {"name": "test", "num_bytes": 318560673, "num_examples": 52796, "dataset_name": "crd3"}, "validation": {"name": "validation", "num_bytes": 318560673, "num_examples": 52796, "dataset_name": "crd3"}}, "download_checksums": {"https://github.com/RevanthRameshkumar/CRD3/archive/master.zip": {"num_bytes": 294222220, "checksum": "c77a937394f265735ba54b32a7a051f77a97d264c74b0535dee77ef9791815b5"}}, "download_size": 294222220, "post_processing_size": null, "dataset_size": 955682019, "size_in_bytes": 1249904239}}
{"default": {"description": "\nStorytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset.\nCritical Role is an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons, an open-ended role-playing game.\nThe dataset is collected from 159 Critical Role episodes transcribed to text dialogues, consisting of 398,682 turns. It also includes corresponding\nabstractive summaries collected from the Fandom wiki. The dataset is linguistically unique in that the narratives are generated entirely through player\ncollaboration and spoken interaction. For each dialogue, there are a large number of turns, multiple abstractive summaries with varying levels of detail,\nand semantic ties to the previous dialogues.\n", "citation": "\n@inproceedings{\ntitle = {Storytelling with Dialogue: A Critical Role Dungeons and Dragons Dataset},\nauthor = {Rameshkumar, Revanth and Bailey, Peter},\nyear = {2020},\npublisher = {Association for Computational Linguistics},\nconference = {ACL}\n}\n ", "homepage": "https://github.com/RevanthRameshkumar/CRD3", "license": "", "features": {"chunk": {"dtype": "string", "id": null, "_type": "Value"}, "chunk_id": {"dtype": "int32", "id": null, "_type": "Value"}, "turn_start": {"dtype": "int32", "id": null, "_type": "Value"}, "turn_end": {"dtype": "int32", "id": null, "_type": "Value"}, "alignment_score": {"dtype": "float32", "id": null, "_type": "Value"}, "turns": [{"names": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "utterances": {"feature": {"dtype": "string", "id": null, "_type": "Value"}, "length": -1, "id": null, "_type": "Sequence"}, "number": {"dtype": "int32", "id": null, "_type": "Value"}}]}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "crd3", "config_name": "default", "version": {"version_str": "0.0.0", "description": null, "major": 0, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 236605152, "num_examples": 38969, "dataset_name": "crd3"}, "test": {"name": "test", "num_bytes": 40269203, "num_examples": 7500, "dataset_name": "crd3"}, "validation": {"name": "validation", "num_bytes": 41543528, "num_examples": 6327, "dataset_name": "crd3"}}, "download_checksums": {"https://huggingface.co/datasets/crd3/resolve/72bffe55b4d5bf19b530d3e417447b3384ba3673/data/aligned%20data.zip": {"num_bytes": 117519820, "checksum": "c66bd9f7848bcd514a35c154edd2fc874f1a3076876d8bd7208bf3caf4b7fb0b"}}, "download_size": 117519820, "post_processing_size": null, "dataset_size": 318417883, "size_in_bytes": 435937703}}
Binary file modified datasets/crd3/dummy/0.0.0/dummy_data.zip
Binary file not shown.
3 changes: 2 additions & 1 deletion datasets/mlsum/README.md
Expand Up @@ -20,12 +20,13 @@ source_datasets:
- extended|cnn_dailymail
- original
task_categories:
- summarization
- translation
- text-classification
task_ids:
- news-articles-summarization
- multi-class-classification
- multi-label-classification
- summarization
- topic-classification
paperswithcode_id: mlsum
pretty_name: MLSUM
Expand Down

1 comment on commit c4b4cb6

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009117 / 0.011353 (-0.002236) 0.002967 / 0.011008 (-0.008041) 0.020847 / 0.038508 (-0.017661) 0.023432 / 0.023109 (0.000323) 0.256254 / 0.275898 (-0.019644) 0.306597 / 0.323480 (-0.016883) 0.004351 / 0.007986 (-0.003635) 0.003273 / 0.004328 (-0.001056) 0.005047 / 0.004250 (0.000796) 0.038329 / 0.037052 (0.001277) 0.252400 / 0.258489 (-0.006089) 0.295854 / 0.293841 (0.002013) 0.021129 / 0.128546 (-0.107417) 0.007022 / 0.075646 (-0.068624) 0.200530 / 0.419271 (-0.218742) 0.036558 / 0.043533 (-0.006975) 0.246199 / 0.255139 (-0.008940) 0.276380 / 0.283200 (-0.006819) 0.074232 / 0.141683 (-0.067451) 1.247616 / 1.452155 (-0.204538) 1.207528 / 1.492716 (-0.285188)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.162736 / 0.018006 (0.144730) 0.463255 / 0.000490 (0.462765) 0.000554 / 0.000200 (0.000354) 0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.016406 / 0.037411 (-0.021005) 0.073740 / 0.014526 (0.059214) 0.080011 / 0.176557 (-0.096546) 0.114384 / 0.737135 (-0.622751) 0.081351 / 0.296338 (-0.214987)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.329767 / 0.215209 (0.114558) 3.295624 / 2.077655 (1.217969) 1.511803 / 1.504120 (0.007683) 1.377860 / 1.541195 (-0.163335) 1.393836 / 1.468490 (-0.074654) 0.337875 / 4.584777 (-4.246902) 2.918990 / 3.745712 (-0.826722) 1.468587 / 5.269862 (-3.801275) 0.891992 / 4.565676 (-3.673685) 0.044716 / 0.424275 (-0.379560) 0.008215 / 0.007607 (0.000608) 0.415988 / 0.226044 (0.189943) 4.189235 / 2.268929 (1.920307) 1.855979 / 55.444624 (-53.588645) 1.597131 / 6.876477 (-5.279345) 1.720794 / 2.142072 (-0.421279) 0.435706 / 4.805227 (-4.369521) 0.094202 / 6.500664 (-6.406462) 0.045561 / 0.075469 (-0.029908)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.109329 / 1.841788 (-0.732458) 13.370726 / 8.074308 (5.296418) 19.016395 / 10.191392 (8.825003) 0.656265 / 0.680424 (-0.024159) 0.454665 / 0.534201 (-0.079536) 0.266558 / 0.579283 (-0.312725) 0.329818 / 0.434364 (-0.104546) 0.177836 / 0.540337 (-0.362501) 0.202651 / 1.386936 (-1.184285)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.004431 / 0.011353 (-0.006922) 0.003015 / 0.011008 (-0.007993) 0.025303 / 0.038508 (-0.013205) 0.028572 / 0.023109 (0.005463) 0.306509 / 0.275898 (0.030611) 0.338898 / 0.323480 (0.015418) 0.002716 / 0.007986 (-0.005270) 0.003309 / 0.004328 (-0.001019) 0.003499 / 0.004250 (-0.000751) 0.029487 / 0.037052 (-0.007566) 0.309897 / 0.258489 (0.051408) 0.323312 / 0.293841 (0.029472) 0.019675 / 0.128546 (-0.108871) 0.007029 / 0.075646 (-0.068618) 0.198127 / 0.419271 (-0.221144) 0.034980 / 0.043533 (-0.008553) 0.305895 / 0.255139 (0.050756) 0.311157 / 0.283200 (0.027958) 0.073236 / 0.141683 (-0.068446) 1.192255 / 1.452155 (-0.259900) 1.191005 / 1.492716 (-0.301712)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.173402 / 0.018006 (0.155395) 0.379643 / 0.000490 (0.379153) 0.008050 / 0.000200 (0.007850) 0.000315 / 0.000054 (0.000261)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.015549 / 0.037411 (-0.021862) 0.079889 / 0.014526 (0.065363) 0.081794 / 0.176557 (-0.094763) 0.115633 / 0.737135 (-0.621502) 0.088899 / 0.296338 (-0.207439)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.333646 / 0.215209 (0.118437) 3.258509 / 2.077655 (1.180854) 1.615450 / 1.504120 (0.111330) 1.493750 / 1.541195 (-0.047444) 1.575404 / 1.468490 (0.106914) 0.344097 / 4.584777 (-4.240680) 2.994121 / 3.745712 (-0.751591) 1.499913 / 5.269862 (-3.769948) 0.881479 / 4.565676 (-3.684197) 0.047387 / 0.424275 (-0.376889) 0.007802 / 0.007607 (0.000195) 0.409022 / 0.226044 (0.182978) 4.115645 / 2.268929 (1.846716) 1.907671 / 55.444624 (-53.536953) 1.617787 / 6.876477 (-5.258689) 1.767767 / 2.142072 (-0.374305) 0.423246 / 4.805227 (-4.381981) 0.092868 / 6.500664 (-6.407797) 0.054294 / 0.075469 (-0.021175)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.211968 / 1.841788 (-0.629819) 14.518249 / 8.074308 (6.443941) 20.593619 / 10.191392 (10.402227) 0.702371 / 0.680424 (0.021947) 0.459742 / 0.534201 (-0.074459) 0.285457 / 0.579283 (-0.293826) 0.328908 / 0.434364 (-0.105456) 0.189993 / 0.540337 (-0.350344) 0.211712 / 1.386936 (-1.175224)

CML watermark

Please sign in to comment.