Skip to content

Commit

Permalink
Update src/datasets/table.py
Browse files Browse the repository at this point in the history
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
  • Loading branch information
lhoestq and albertvillanova committed Oct 14, 2022
1 parent 4ac0c1b commit d5f243c
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/table.py
Expand Up @@ -2151,7 +2151,7 @@ def _visit(array, feature):


def table_iter(pa_table: pa.Table, batch_size: int, drop_last_batch=False):
"""Iterate ober sub-tables of size `batch_size`.
"""Iterate over sub-tables of size `batch_size`.
Requires pyarrow>=8.0.0
Expand Down

1 comment on commit d5f243c

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009951 / 0.011353 (-0.001402) 0.005865 / 0.011008 (-0.005143) 0.103218 / 0.038508 (0.064710) 0.040663 / 0.023109 (0.017554) 0.299380 / 0.275898 (0.023482) 0.376883 / 0.323480 (0.053403) 0.008762 / 0.007986 (0.000776) 0.006303 / 0.004328 (0.001975) 0.075945 / 0.004250 (0.071695) 0.047911 / 0.037052 (0.010859) 0.307615 / 0.258489 (0.049125) 0.359996 / 0.293841 (0.066155) 0.045043 / 0.128546 (-0.083503) 0.015424 / 0.075646 (-0.060223) 0.340073 / 0.419271 (-0.079198) 0.052613 / 0.043533 (0.009080) 0.299641 / 0.255139 (0.044502) 0.321613 / 0.283200 (0.038413) 0.109341 / 0.141683 (-0.032342) 1.509194 / 1.452155 (0.057039) 1.536263 / 1.492716 (0.043546)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.011005 / 0.018006 (-0.007001) 0.516317 / 0.000490 (0.515828) 0.004545 / 0.000200 (0.004345) 0.000101 / 0.000054 (0.000047)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026332 / 0.037411 (-0.011080) 0.111260 / 0.014526 (0.096735) 0.119010 / 0.176557 (-0.057546) 0.167847 / 0.737135 (-0.569289) 0.124462 / 0.296338 (-0.171877)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.398937 / 0.215209 (0.183728) 3.969053 / 2.077655 (1.891399) 1.803239 / 1.504120 (0.299119) 1.620947 / 1.541195 (0.079752) 1.719039 / 1.468490 (0.250548) 0.681812 / 4.584777 (-3.902965) 3.749315 / 3.745712 (0.003603) 2.231583 / 5.269862 (-3.038279) 1.516008 / 4.565676 (-3.049669) 0.084250 / 0.424275 (-0.340025) 0.011911 / 0.007607 (0.004304) 0.500982 / 0.226044 (0.274938) 5.050484 / 2.268929 (2.781555) 2.219168 / 55.444624 (-53.225457) 1.918346 / 6.876477 (-4.958131) 2.135077 / 2.142072 (-0.006996) 0.835382 / 4.805227 (-3.969845) 0.164094 / 6.500664 (-6.336570) 0.064006 / 0.075469 (-0.011463)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.516452 / 1.841788 (-0.325336) 15.036813 / 8.074308 (6.962505) 25.509199 / 10.191392 (15.317807) 0.833807 / 0.680424 (0.153383) 0.540849 / 0.534201 (0.006648) 0.436984 / 0.579283 (-0.142299) 0.424905 / 0.434364 (-0.009458) 0.263872 / 0.540337 (-0.276466) 0.273112 / 1.386936 (-1.113824)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008535 / 0.011353 (-0.002818) 0.005679 / 0.011008 (-0.005329) 0.102523 / 0.038508 (0.064015) 0.042456 / 0.023109 (0.019346) 0.401381 / 0.275898 (0.125483) 0.432271 / 0.323480 (0.108791) 0.006936 / 0.007986 (-0.001049) 0.004524 / 0.004328 (0.000196) 0.076019 / 0.004250 (0.071769) 0.046677 / 0.037052 (0.009625) 0.408773 / 0.258489 (0.150284) 0.444488 / 0.293841 (0.150647) 0.038277 / 0.128546 (-0.090269) 0.012525 / 0.075646 (-0.063122) 0.339638 / 0.419271 (-0.079634) 0.051976 / 0.043533 (0.008444) 0.396384 / 0.255139 (0.141245) 0.412914 / 0.283200 (0.129714) 0.116045 / 0.141683 (-0.025637) 1.441337 / 1.452155 (-0.010818) 1.582525 / 1.492716 (0.089808)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.241009 / 0.018006 (0.223003) 0.498498 / 0.000490 (0.498008) 0.003052 / 0.000200 (0.002852) 0.000107 / 0.000054 (0.000052)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025989 / 0.037411 (-0.011422) 0.110070 / 0.014526 (0.095545) 0.119657 / 0.176557 (-0.056900) 0.169630 / 0.737135 (-0.567505) 0.127521 / 0.296338 (-0.168817)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.431001 / 0.215209 (0.215791) 4.236730 / 2.077655 (2.159075) 2.050476 / 1.504120 (0.546356) 1.866657 / 1.541195 (0.325462) 1.878918 / 1.468490 (0.410428) 0.725949 / 4.584777 (-3.858828) 3.821068 / 3.745712 (0.075356) 3.086914 / 5.269862 (-2.182948) 1.828434 / 4.565676 (-2.737243) 0.088641 / 0.424275 (-0.335634) 0.012693 / 0.007607 (0.005086) 0.551564 / 0.226044 (0.325519) 5.401535 / 2.268929 (3.132607) 2.555928 / 55.444624 (-52.888697) 2.199399 / 6.876477 (-4.677078) 2.450676 / 2.142072 (0.308604) 0.865674 / 4.805227 (-3.939554) 0.171116 / 6.500664 (-6.329548) 0.066196 / 0.075469 (-0.009273)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.570107 / 1.841788 (-0.271680) 15.009869 / 8.074308 (6.935561) 11.771047 / 10.191392 (1.579655) 0.909175 / 0.680424 (0.228751) 0.612011 / 0.534201 (0.077810) 0.424333 / 0.579283 (-0.154950) 0.416765 / 0.434364 (-0.017599) 0.256028 / 0.540337 (-0.284309) 0.264803 / 1.386936 (-1.122133)

CML watermark

Please sign in to comment.