Skip to content

Commit

Permalink
mention pyarrow>=8 in docstring as well
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 14, 2022
1 parent e4ae9cd commit ed7631d
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/datasets/table.py
Expand Up @@ -2152,6 +2152,8 @@ def _visit(array, feature):

def table_iter(pa_table: pa.Table, batch_size: int, drop_last_batch=False):
"""Iterate ober sub-tables of size `batch_size`.
Requires pyarrow>=8.0.0
Args:
table (:obj:`pyarrow.Table`): PyArrow table to iterate over
Expand Down

1 comment on commit ed7631d

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.012008 / 0.011353 (0.000656) 0.005935 / 0.011008 (-0.005073) 0.142792 / 0.038508 (0.104284) 0.039626 / 0.023109 (0.016517) 0.389481 / 0.275898 (0.113583) 0.475048 / 0.323480 (0.151568) 0.009141 / 0.007986 (0.001155) 0.005133 / 0.004328 (0.000805) 0.100700 / 0.004250 (0.096450) 0.043904 / 0.037052 (0.006851) 0.404437 / 0.258489 (0.145948) 0.482382 / 0.293841 (0.188541) 0.064255 / 0.128546 (-0.064291) 0.025560 / 0.075646 (-0.050086) 0.447716 / 0.419271 (0.028445) 0.067548 / 0.043533 (0.024016) 0.386062 / 0.255139 (0.130923) 0.459227 / 0.283200 (0.176027) 0.108326 / 0.141683 (-0.033357) 1.912930 / 1.452155 (0.460776) 1.934000 / 1.492716 (0.441284)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.013607 / 0.018006 (-0.004400) 0.566212 / 0.000490 (0.565722) 0.005051 / 0.000200 (0.004851) 0.000142 / 0.000054 (0.000087)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026652 / 0.037411 (-0.010760) 0.110755 / 0.014526 (0.096230) 0.128200 / 0.176557 (-0.048357) 0.189353 / 0.737135 (-0.547783) 0.129373 / 0.296338 (-0.166966)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.607751 / 0.215209 (0.392542) 6.043378 / 2.077655 (3.965724) 2.333934 / 1.504120 (0.829815) 2.064700 / 1.541195 (0.523506) 2.072919 / 1.468490 (0.604429) 1.193340 / 4.584777 (-3.391436) 5.377566 / 3.745712 (1.631854) 5.293720 / 5.269862 (0.023859) 2.809408 / 4.565676 (-1.756269) 0.143372 / 0.424275 (-0.280903) 0.014890 / 0.007607 (0.007283) 0.782114 / 0.226044 (0.556070) 8.036556 / 2.268929 (5.767628) 3.235485 / 55.444624 (-52.209139) 2.397937 / 6.876477 (-4.478540) 2.587957 / 2.142072 (0.445885) 1.487716 / 4.805227 (-3.317511) 0.248913 / 6.500664 (-6.251751) 0.076431 / 0.075469 (0.000962)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.933337 / 1.841788 (0.091550) 16.929323 / 8.074308 (8.855015) 42.954324 / 10.191392 (32.762932) 1.194935 / 0.680424 (0.514512) 0.744476 / 0.534201 (0.210275) 0.569671 / 0.579283 (-0.009612) 0.612883 / 0.434364 (0.178519) 0.375548 / 0.540337 (-0.164790) 0.370791 / 1.386936 (-1.016145)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009907 / 0.011353 (-0.001446) 0.006251 / 0.011008 (-0.004758) 0.141341 / 0.038508 (0.102833) 0.037926 / 0.023109 (0.014817) 0.426508 / 0.275898 (0.150610) 0.469651 / 0.323480 (0.146171) 0.007121 / 0.007986 (-0.000864) 0.005334 / 0.004328 (0.001005) 0.100143 / 0.004250 (0.095893) 0.043246 / 0.037052 (0.006193) 0.405898 / 0.258489 (0.147409) 0.524118 / 0.293841 (0.230277) 0.055741 / 0.128546 (-0.072805) 0.019975 / 0.075646 (-0.055671) 0.429455 / 0.419271 (0.010184) 0.070062 / 0.043533 (0.026529) 0.427629 / 0.255139 (0.172490) 0.447975 / 0.283200 (0.164775) 0.113934 / 0.141683 (-0.027749) 1.868658 / 1.452155 (0.416503) 1.971118 / 1.492716 (0.478402)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.233530 / 0.018006 (0.215523) 0.499936 / 0.000490 (0.499447) 0.007720 / 0.000200 (0.007521) 0.000126 / 0.000054 (0.000072)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029096 / 0.037411 (-0.008315) 0.120903 / 0.014526 (0.106377) 0.131141 / 0.176557 (-0.045416) 0.190602 / 0.737135 (-0.546533) 0.138792 / 0.296338 (-0.157547)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.624073 / 0.215209 (0.408864) 6.367776 / 2.077655 (4.290122) 2.645369 / 1.504120 (1.141249) 2.363106 / 1.541195 (0.821912) 2.309194 / 1.468490 (0.840704) 1.248456 / 4.584777 (-3.336321) 5.445290 / 3.745712 (1.699578) 5.263392 / 5.269862 (-0.006469) 2.276895 / 4.565676 (-2.288781) 0.147192 / 0.424275 (-0.277083) 0.014434 / 0.007607 (0.006827) 0.769608 / 0.226044 (0.543563) 7.800331 / 2.268929 (5.531402) 3.396933 / 55.444624 (-52.047692) 2.768541 / 6.876477 (-4.107935) 2.823788 / 2.142072 (0.681715) 1.481027 / 4.805227 (-3.324201) 0.249265 / 6.500664 (-6.251399) 0.077160 / 0.075469 (0.001691)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.162945 / 1.841788 (0.321158) 16.884475 / 8.074308 (8.810167) 19.862182 / 10.191392 (9.670790) 1.199678 / 0.680424 (0.519254) 0.781392 / 0.534201 (0.247191) 0.526558 / 0.579283 (-0.052725) 0.581085 / 0.434364 (0.146721) 0.340345 / 0.540337 (-0.199992) 0.363794 / 1.386936 (-1.023142)

CML watermark

Please sign in to comment.