Skip to content

Commit

Permalink
Add from_generator to package reference
Browse files Browse the repository at this point in the history
  • Loading branch information
mariosasko committed Sep 13, 2022
1 parent b216158 commit e5d7fad
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1 deletion.
1 change: 1 addition & 0 deletions docs/source/package_reference/main_classes.mdx
Expand Up @@ -16,6 +16,7 @@ The base class [`Dataset`] implements a Dataset backed by an Apache Arrow table.
- from_buffer
- from_pandas
- from_dict
- from_generator
- data
- cache_files
- num_columns
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/arrow_dataset.py
Expand Up @@ -950,7 +950,7 @@ def from_generator(
"""Create a Dataset from a generator.
Args:
generator (:obj:`Callable`): A callable object that returns an object that supports the `iter()` protocol.
generator (:obj:`Callable`): A callable object that returns an object that supports the `iter` protocol.
features (:class:`Features`, optional): Dataset features.
cache_dir (:obj:`str`, optional, default ``"~/.cache/huggingface/datasets"``): Directory to cache data.
keep_in_memory (:obj:`bool`, default ``False``): Whether to copy the data in-memory.
Expand Down

1 comment on commit e5d7fad

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009498 / 0.011353 (-0.001855) 0.004374 / 0.011008 (-0.006635) 0.037486 / 0.038508 (-0.001022) 0.038661 / 0.023109 (0.015552) 0.368415 / 0.275898 (0.092517) 0.414148 / 0.323480 (0.090668) 0.006387 / 0.007986 (-0.001599) 0.003488 / 0.004328 (-0.000840) 0.007987 / 0.004250 (0.003737) 0.049731 / 0.037052 (0.012679) 0.369489 / 0.258489 (0.111000) 0.410342 / 0.293841 (0.116501) 0.044211 / 0.128546 (-0.084335) 0.015001 / 0.075646 (-0.060645) 0.295113 / 0.419271 (-0.124159) 0.061500 / 0.043533 (0.017967) 0.374156 / 0.255139 (0.119017) 0.376152 / 0.283200 (0.092952) 0.100080 / 0.141683 (-0.041603) 1.783287 / 1.452155 (0.331133) 1.775241 / 1.492716 (0.282525)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.222908 / 0.018006 (0.204901) 0.532282 / 0.000490 (0.531793) 0.012225 / 0.000200 (0.012025) 0.000571 / 0.000054 (0.000517)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024997 / 0.037411 (-0.012414) 0.103363 / 0.014526 (0.088837) 0.122138 / 0.176557 (-0.054419) 0.169263 / 0.737135 (-0.567872) 0.130980 / 0.296338 (-0.165359)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.575573 / 0.215209 (0.360364) 5.837951 / 2.077655 (3.760296) 2.326067 / 1.504120 (0.821947) 1.920003 / 1.541195 (0.378808) 1.910331 / 1.468490 (0.441841) 0.706331 / 4.584777 (-3.878446) 5.128536 / 3.745712 (1.382824) 2.884717 / 5.269862 (-2.385145) 2.032073 / 4.565676 (-2.533603) 0.083812 / 0.424275 (-0.340463) 0.012551 / 0.007607 (0.004944) 0.750667 / 0.226044 (0.524623) 7.475021 / 2.268929 (5.206092) 3.041478 / 55.444624 (-52.403146) 2.626149 / 6.876477 (-4.250327) 2.607943 / 2.142072 (0.465870) 0.917134 / 4.805227 (-3.888093) 0.175828 / 6.500664 (-6.324836) 0.072527 / 0.075469 (-0.002943)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.816851 / 1.841788 (-0.024936) 14.912606 / 8.074308 (6.838298) 39.289427 / 10.191392 (29.098035) 1.088708 / 0.680424 (0.408284) 0.655583 / 0.534201 (0.121382) 0.481082 / 0.579283 (-0.098201) 0.560042 / 0.434364 (0.125678) 0.321259 / 0.540337 (-0.219078) 0.329232 / 1.386936 (-1.057704)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007365 / 0.011353 (-0.003988) 0.004562 / 0.011008 (-0.006446) 0.038216 / 0.038508 (-0.000292) 0.032873 / 0.023109 (0.009764) 0.419691 / 0.275898 (0.143793) 0.446148 / 0.323480 (0.122668) 0.004593 / 0.007986 (-0.003393) 0.004080 / 0.004328 (-0.000248) 0.006290 / 0.004250 (0.002040) 0.042603 / 0.037052 (0.005551) 0.396321 / 0.258489 (0.137832) 0.432459 / 0.293841 (0.138618) 0.045507 / 0.128546 (-0.083039) 0.015789 / 0.075646 (-0.059857) 0.374109 / 0.419271 (-0.045163) 0.069369 / 0.043533 (0.025836) 0.393663 / 0.255139 (0.138524) 0.421196 / 0.283200 (0.137996) 0.118327 / 0.141683 (-0.023356) 1.737895 / 1.452155 (0.285740) 1.724792 / 1.492716 (0.232076)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.170820 / 0.018006 (0.152814) 0.488472 / 0.000490 (0.487982) 0.005414 / 0.000200 (0.005215) 0.000118 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024582 / 0.037411 (-0.012830) 0.111456 / 0.014526 (0.096930) 0.116420 / 0.176557 (-0.060136) 0.164795 / 0.737135 (-0.572341) 0.121213 / 0.296338 (-0.175125)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.647560 / 0.215209 (0.432351) 6.420298 / 2.077655 (4.342643) 2.755315 / 1.504120 (1.251195) 2.314761 / 1.541195 (0.773566) 2.291083 / 1.468490 (0.822593) 0.759935 / 4.584777 (-3.824842) 5.215763 / 3.745712 (1.470051) 2.907277 / 5.269862 (-2.362585) 1.874673 / 4.565676 (-2.691004) 0.096813 / 0.424275 (-0.327462) 0.014196 / 0.007607 (0.006589) 0.786263 / 0.226044 (0.560219) 8.235938 / 2.268929 (5.967010) 3.454411 / 55.444624 (-51.990213) 2.800485 / 6.876477 (-4.075992) 2.846855 / 2.142072 (0.704783) 0.966463 / 4.805227 (-3.838764) 0.198301 / 6.500664 (-6.302363) 0.073263 / 0.075469 (-0.002206)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.871674 / 1.841788 (0.029887) 15.195977 / 8.074308 (7.121669) 41.062931 / 10.191392 (30.871539) 1.089583 / 0.680424 (0.409159) 0.703173 / 0.534201 (0.168972) 0.471508 / 0.579283 (-0.107775) 0.603336 / 0.434364 (0.168972) 0.343581 / 0.540337 (-0.196757) 0.364782 / 1.386936 (-1.022155)

CML watermark

Please sign in to comment.