Skip to content

Commit

Permalink
Update src/datasets/arrow_dataset.py
Browse files Browse the repository at this point in the history
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
  • Loading branch information
mariosasko and lhoestq committed Sep 16, 2022
1 parent 5b9aac6 commit 5ade8d3
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/arrow_dataset.py
Expand Up @@ -950,7 +950,7 @@ def from_generator(
"""Create a Dataset from a generator.
Args:
generator (:obj:`Callable`): A callable object that returns an object that supports the `iter` protocol.
generator (:obj:`Callable`): A generator function that `yields` examples.
features (:class:`Features`, optional): Dataset features.
cache_dir (:obj:`str`, optional, default ``"~/.cache/huggingface/datasets"``): Directory to cache data.
keep_in_memory (:obj:`bool`, default ``False``): Whether to copy the data in-memory.
Expand Down

1 comment on commit 5ade8d3

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008457 / 0.011353 (-0.002895) 0.004230 / 0.011008 (-0.006778) 0.031846 / 0.038508 (-0.006662) 0.032122 / 0.023109 (0.009013) 0.322778 / 0.275898 (0.046880) 0.397894 / 0.323480 (0.074414) 0.006188 / 0.007986 (-0.001797) 0.003510 / 0.004328 (-0.000819) 0.007761 / 0.004250 (0.003511) 0.042719 / 0.037052 (0.005667) 0.334711 / 0.258489 (0.076222) 0.376435 / 0.293841 (0.082594) 0.041099 / 0.128546 (-0.087447) 0.013237 / 0.075646 (-0.062410) 0.277159 / 0.419271 (-0.142112) 0.058253 / 0.043533 (0.014720) 0.346509 / 0.255139 (0.091370) 0.355365 / 0.283200 (0.072165) 0.095184 / 0.141683 (-0.046499) 1.562243 / 1.452155 (0.110088) 1.594233 / 1.492716 (0.101517)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.196088 / 0.018006 (0.178082) 0.551735 / 0.000490 (0.551246) 0.008144 / 0.000200 (0.007944) 0.000243 / 0.000054 (0.000188)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022943 / 0.037411 (-0.014469) 0.112779 / 0.014526 (0.098253) 0.131141 / 0.176557 (-0.045415) 0.174315 / 0.737135 (-0.562820) 0.132009 / 0.296338 (-0.164329)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.506496 / 0.215209 (0.291287) 5.232314 / 2.077655 (3.154659) 2.077418 / 1.504120 (0.573298) 1.754518 / 1.541195 (0.213323) 1.813869 / 1.468490 (0.345379) 0.670642 / 4.584777 (-3.914135) 4.788414 / 3.745712 (1.042702) 4.810641 / 5.269862 (-0.459220) 2.447734 / 4.565676 (-2.117942) 0.079200 / 0.424275 (-0.345075) 0.012692 / 0.007607 (0.005085) 0.681546 / 0.226044 (0.455501) 6.851220 / 2.268929 (4.582291) 2.709401 / 55.444624 (-52.735224) 2.055042 / 6.876477 (-4.821435) 2.183155 / 2.142072 (0.041083) 0.839237 / 4.805227 (-3.965991) 0.164828 / 6.500664 (-6.335836) 0.063728 / 0.075469 (-0.011741)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.686089 / 1.841788 (-0.155699) 14.108562 / 8.074308 (6.034254) 37.361939 / 10.191392 (27.170547) 1.084532 / 0.680424 (0.404108) 0.656515 / 0.534201 (0.122314) 0.441555 / 0.579283 (-0.137728) 0.540464 / 0.434364 (0.106100) 0.304398 / 0.540337 (-0.235939) 0.312587 / 1.386936 (-1.074349)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006633 / 0.011353 (-0.004719) 0.004029 / 0.011008 (-0.006979) 0.031612 / 0.038508 (-0.006896) 0.030206 / 0.023109 (0.007096) 0.379044 / 0.275898 (0.103146) 0.431605 / 0.323480 (0.108125) 0.004382 / 0.007986 (-0.003604) 0.003486 / 0.004328 (-0.000842) 0.005198 / 0.004250 (0.000948) 0.040412 / 0.037052 (0.003360) 0.386297 / 0.258489 (0.127808) 0.448502 / 0.293841 (0.154661) 0.049857 / 0.128546 (-0.078689) 0.013867 / 0.075646 (-0.061780) 0.282989 / 0.419271 (-0.136282) 0.060918 / 0.043533 (0.017386) 0.377689 / 0.255139 (0.122550) 0.401837 / 0.283200 (0.118637) 0.096541 / 0.141683 (-0.045142) 1.654954 / 1.452155 (0.202799) 1.684325 / 1.492716 (0.191609)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.283906 / 0.018006 (0.265900) 0.483818 / 0.000490 (0.483328) 0.032465 / 0.000200 (0.032265) 0.000402 / 0.000054 (0.000347)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021728 / 0.037411 (-0.015683) 0.103854 / 0.014526 (0.089328) 0.112196 / 0.176557 (-0.064361) 0.151400 / 0.737135 (-0.585735) 0.132381 / 0.296338 (-0.163958)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.567652 / 0.215209 (0.352443) 5.541936 / 2.077655 (3.464282) 2.252733 / 1.504120 (0.748613) 1.992490 / 1.541195 (0.451295) 2.021411 / 1.468490 (0.552921) 0.692559 / 4.584777 (-3.892218) 4.900426 / 3.745712 (1.154713) 4.507241 / 5.269862 (-0.762620) 2.172903 / 4.565676 (-2.392773) 0.081531 / 0.424275 (-0.342745) 0.015326 / 0.007607 (0.007719) 0.716518 / 0.226044 (0.490474) 7.114480 / 2.268929 (4.845551) 2.934200 / 55.444624 (-52.510424) 2.329946 / 6.876477 (-4.546531) 2.399162 / 2.142072 (0.257089) 0.835884 / 4.805227 (-3.969344) 0.166524 / 6.500664 (-6.334140) 0.065528 / 0.075469 (-0.009941)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.757316 / 1.841788 (-0.084471) 14.210715 / 8.074308 (6.136407) 38.272212 / 10.191392 (28.080820) 1.150805 / 0.680424 (0.470381) 0.728331 / 0.534201 (0.194130) 0.437929 / 0.579283 (-0.141354) 0.548410 / 0.434364 (0.114046) 0.315366 / 0.540337 (-0.224971) 0.313558 / 1.386936 (-1.073378)

CML watermark

Please sign in to comment.