Skip to content

Commit

Permalink
Fix docs about dataset_info in YAML (#5194)
Browse files Browse the repository at this point in the history
Fix docs about dataset_info
  • Loading branch information
albertvillanova committed Nov 3, 2022
1 parent 2cfb5d5 commit b218965
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions docs/source/image_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -347,15 +347,15 @@ def _generate_examples(self, images, metadata_path):

### Generate the dataset metadata (optional)

The dataset metadata you added earlier now needs to be generated and stored in a file called `datasets_infos.json`. In addition to information about a datasets features and description, this file also contains data file checksums to ensure integrity.
The dataset metadata can be generated and stored in the dataset card (`README.md` file).

Run the following command to generate your dataset metadata in `dataset_infos.json` and make sure your new loading script works correctly:
Run the following command to generate your dataset metadata in `README.md` and make sure your new loading script works correctly:

```bash
datasets-cli test path/to/<your-dataset-loading-script> --save_info --all_configs
```

If your loading script passed the test, you should now have a `dataset_infos.json` file in your dataset folder.
If your loading script passed the test, you should now have the `dataset_info` YAML fields in the header of the `README.md` file in your dataset folder.

### Upload the dataset to the Hub

Expand Down

1 comment on commit b218965

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009011 / 0.011353 (-0.002342) 0.004731 / 0.011008 (-0.006277) 0.101719 / 0.038508 (0.063211) 0.030921 / 0.023109 (0.007812) 0.301886 / 0.275898 (0.025988) 0.367264 / 0.323480 (0.043784) 0.007287 / 0.007986 (-0.000699) 0.004832 / 0.004328 (0.000504) 0.078387 / 0.004250 (0.074137) 0.042185 / 0.037052 (0.005132) 0.309680 / 0.258489 (0.051191) 0.356915 / 0.293841 (0.063074) 0.038280 / 0.128546 (-0.090266) 0.014687 / 0.075646 (-0.060960) 0.328816 / 0.419271 (-0.090455) 0.045184 / 0.043533 (0.001651) 0.302045 / 0.255139 (0.046906) 0.327469 / 0.283200 (0.044270) 0.093919 / 0.141683 (-0.047763) 1.543839 / 1.452155 (0.091684) 1.645486 / 1.492716 (0.152770)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.200650 / 0.018006 (0.182644) 0.436805 / 0.000490 (0.436315) 0.002049 / 0.000200 (0.001849) 0.000075 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023863 / 0.037411 (-0.013549) 0.099837 / 0.014526 (0.085311) 0.107043 / 0.176557 (-0.069514) 0.144829 / 0.737135 (-0.592306) 0.109750 / 0.296338 (-0.186589)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.416715 / 0.215209 (0.201506) 4.154464 / 2.077655 (2.076809) 1.916487 / 1.504120 (0.412367) 1.723028 / 1.541195 (0.181833) 1.787333 / 1.468490 (0.318843) 0.689326 / 4.584777 (-3.895451) 3.397413 / 3.745712 (-0.348299) 2.878218 / 5.269862 (-2.391644) 1.508892 / 4.565676 (-3.056785) 0.081027 / 0.424275 (-0.343248) 0.011655 / 0.007607 (0.004048) 0.519657 / 0.226044 (0.293613) 5.188629 / 2.268929 (2.919700) 2.338709 / 55.444624 (-53.105915) 2.024907 / 6.876477 (-4.851570) 2.149147 / 2.142072 (0.007074) 0.800425 / 4.805227 (-4.004803) 0.147963 / 6.500664 (-6.352701) 0.064950 / 0.075469 (-0.010519)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.531523 / 1.841788 (-0.310264) 12.935816 / 8.074308 (4.861507) 26.480697 / 10.191392 (16.289305) 0.866167 / 0.680424 (0.185743) 0.605501 / 0.534201 (0.071300) 0.390453 / 0.579283 (-0.188831) 0.408905 / 0.434364 (-0.025459) 0.234077 / 0.540337 (-0.306260) 0.238309 / 1.386936 (-1.148627)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006957 / 0.011353 (-0.004395) 0.004605 / 0.011008 (-0.006403) 0.098035 / 0.038508 (0.059527) 0.029306 / 0.023109 (0.006196) 0.342802 / 0.275898 (0.066904) 0.406980 / 0.323480 (0.083500) 0.005477 / 0.007986 (-0.002509) 0.003460 / 0.004328 (-0.000869) 0.074851 / 0.004250 (0.070601) 0.036672 / 0.037052 (-0.000381) 0.361244 / 0.258489 (0.102754) 0.386145 / 0.293841 (0.092304) 0.032465 / 0.128546 (-0.096082) 0.011565 / 0.075646 (-0.064081) 0.323336 / 0.419271 (-0.095936) 0.042098 / 0.043533 (-0.001435) 0.348104 / 0.255139 (0.092965) 0.370067 / 0.283200 (0.086867) 0.093157 / 0.141683 (-0.048526) 1.526246 / 1.452155 (0.074091) 1.557448 / 1.492716 (0.064731)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.230241 / 0.018006 (0.212235) 0.427375 / 0.000490 (0.426885) 0.000806 / 0.000200 (0.000606) 0.000074 / 0.000054 (0.000019)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025327 / 0.037411 (-0.012084) 0.103022 / 0.014526 (0.088496) 0.111284 / 0.176557 (-0.065272) 0.153033 / 0.737135 (-0.584103) 0.115090 / 0.296338 (-0.181248)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.443473 / 0.215209 (0.228264) 4.430010 / 2.077655 (2.352355) 2.087250 / 1.504120 (0.583131) 1.872372 / 1.541195 (0.331177) 1.927850 / 1.468490 (0.459360) 0.703595 / 4.584777 (-3.881182) 3.435487 / 3.745712 (-0.310225) 2.916600 / 5.269862 (-2.353262) 1.430101 / 4.565676 (-3.135575) 0.082303 / 0.424275 (-0.341972) 0.012129 / 0.007607 (0.004522) 0.543308 / 0.226044 (0.317264) 5.553454 / 2.268929 (3.284526) 2.627188 / 55.444624 (-52.817437) 2.277937 / 6.876477 (-4.598539) 2.326230 / 2.142072 (0.184157) 0.815562 / 4.805227 (-3.989665) 0.152925 / 6.500664 (-6.347740) 0.067239 / 0.075469 (-0.008230)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.570985 / 1.841788 (-0.270802) 13.320444 / 8.074308 (5.246136) 12.527161 / 10.191392 (2.335769) 0.934565 / 0.680424 (0.254141) 0.657684 / 0.534201 (0.123483) 0.376900 / 0.579283 (-0.202383) 0.388582 / 0.434364 (-0.045782) 0.217907 / 0.540337 (-0.322431) 0.222823 / 1.386936 (-1.164113)

CML watermark

Please sign in to comment.