Skip to content

Commit

Permalink
add error message for old versions of pyarrow
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 14, 2022
1 parent 1856601 commit e4ae9cd
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/datasets/table.py
Expand Up @@ -2158,6 +2158,8 @@ def table_iter(pa_table: pa.Table, batch_size: int, drop_last_batch=False):
batch_size (:obj:`int`): size of each sub-table to yield
drop_last_batch (:obj:`bool`, default `False`): Drop the last batch if it is smaller than `batch_size`
"""
if config.PYARROW_VERSION.major < 8:
raise RuntimeError(f"pyarrow>=8.0.0 is needed to use table_iter but you have {config.PYARROW_VERSION}")
chunks_buffer = []
chunks_buffer_size = 0
for chunk in pa_table.to_reader(max_chunksize=batch_size):
Expand Down

1 comment on commit e4ae9cd

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.011758 / 0.011353 (0.000405) 0.006385 / 0.011008 (-0.004623) 0.120806 / 0.038508 (0.082297) 0.048410 / 0.023109 (0.025301) 0.360369 / 0.275898 (0.084471) 0.450943 / 0.323480 (0.127463) 0.009930 / 0.007986 (0.001944) 0.004992 / 0.004328 (0.000664) 0.090395 / 0.004250 (0.086145) 0.057800 / 0.037052 (0.020747) 0.377389 / 0.258489 (0.118900) 0.419244 / 0.293841 (0.125403) 0.051155 / 0.128546 (-0.077391) 0.018344 / 0.075646 (-0.057302) 0.417090 / 0.419271 (-0.002181) 0.062470 / 0.043533 (0.018937) 0.362932 / 0.255139 (0.107793) 0.388823 / 0.283200 (0.105623) 0.129828 / 0.141683 (-0.011855) 1.798310 / 1.452155 (0.346155) 1.860014 / 1.492716 (0.367297)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.018009 / 0.018006 (0.000003) 0.511491 / 0.000490 (0.511001) 0.007726 / 0.000200 (0.007526) 0.000413 / 0.000054 (0.000358)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027553 / 0.037411 (-0.009858) 0.131377 / 0.014526 (0.116852) 0.134927 / 0.176557 (-0.041630) 0.185637 / 0.737135 (-0.551499) 0.140567 / 0.296338 (-0.155772)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.475194 / 0.215209 (0.259985) 4.739654 / 2.077655 (2.661999) 2.128776 / 1.504120 (0.624657) 1.941426 / 1.541195 (0.400231) 1.983305 / 1.468490 (0.514815) 0.826919 / 4.584777 (-3.757858) 5.822983 / 3.745712 (2.077271) 4.625166 / 5.269862 (-0.644696) 2.381200 / 4.565676 (-2.184476) 0.100240 / 0.424275 (-0.324035) 0.014355 / 0.007607 (0.006748) 0.600123 / 0.226044 (0.374078) 5.978263 / 2.268929 (3.709335) 2.644816 / 55.444624 (-52.799809) 2.290527 / 6.876477 (-4.585950) 2.478866 / 2.142072 (0.336793) 1.001288 / 4.805227 (-3.803940) 0.197398 / 6.500664 (-6.303266) 0.077607 / 0.075469 (0.002137)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.916571 / 1.841788 (0.074783) 17.914321 / 8.074308 (9.840013) 29.654710 / 10.191392 (19.463318) 1.081130 / 0.680424 (0.400707) 0.669590 / 0.534201 (0.135389) 0.521637 / 0.579283 (-0.057646) 0.626378 / 0.434364 (0.192014) 0.343690 / 0.540337 (-0.196647) 0.327022 / 1.386936 (-1.059914)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009209 / 0.011353 (-0.002144) 0.006373 / 0.011008 (-0.004635) 0.118609 / 0.038508 (0.080101) 0.044376 / 0.023109 (0.021267) 0.407420 / 0.275898 (0.131522) 0.449781 / 0.323480 (0.126301) 0.007245 / 0.007986 (-0.000740) 0.004986 / 0.004328 (0.000658) 0.088800 / 0.004250 (0.084549) 0.050379 / 0.037052 (0.013327) 0.408016 / 0.258489 (0.149527) 0.464417 / 0.293841 (0.170576) 0.045184 / 0.128546 (-0.083363) 0.014588 / 0.075646 (-0.061058) 0.406400 / 0.419271 (-0.012871) 0.060022 / 0.043533 (0.016489) 0.408606 / 0.255139 (0.153467) 0.432743 / 0.283200 (0.149544) 0.125060 / 0.141683 (-0.016623) 1.738584 / 1.452155 (0.286430) 1.858661 / 1.492716 (0.365945)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.255108 / 0.018006 (0.237102) 0.548993 / 0.000490 (0.548503) 0.005081 / 0.000200 (0.004881) 0.000164 / 0.000054 (0.000109)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028643 / 0.037411 (-0.008768) 0.128304 / 0.014526 (0.113778) 0.140914 / 0.176557 (-0.035642) 0.191713 / 0.737135 (-0.545423) 0.142864 / 0.296338 (-0.153475)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.526383 / 0.215209 (0.311173) 5.224932 / 2.077655 (3.147278) 2.699534 / 1.504120 (1.195414) 2.473016 / 1.541195 (0.931821) 2.503907 / 1.468490 (1.035417) 0.850267 / 4.584777 (-3.734509) 5.445174 / 3.745712 (1.699462) 2.580761 / 5.269862 (-2.689101) 1.612903 / 4.565676 (-2.952774) 0.102778 / 0.424275 (-0.321497) 0.014731 / 0.007607 (0.007124) 0.640783 / 0.226044 (0.414739) 6.375608 / 2.268929 (4.106680) 3.211288 / 55.444624 (-52.233336) 2.905443 / 6.876477 (-3.971034) 2.979154 / 2.142072 (0.837082) 1.010131 / 4.805227 (-3.795096) 0.200443 / 6.500664 (-6.300221) 0.076758 / 0.075469 (0.001289)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.997714 / 1.841788 (0.155927) 18.297336 / 8.074308 (10.223028) 14.106595 / 10.191392 (3.915203) 1.188648 / 0.680424 (0.508224) 0.768206 / 0.534201 (0.234005) 0.500726 / 0.579283 (-0.078557) 0.670397 / 0.434364 (0.236033) 0.303039 / 0.540337 (-0.237298) 0.323321 / 1.386936 (-1.063615)

CML watermark

Please sign in to comment.