Skip to content

Commit

Permalink
style
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Sep 8, 2022
1 parent 78a97ec commit 08ed0a2
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion tests/test_dataset_cards.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ def get_changed_datasets(repo_path: Path) -> List[Path]:
changed_datasets = {
f.resolve().relative_to(datasets_dir_path).parts[0]
for f in changed_files
if f.exists() and str(f.resolve()).startswith(str(datasets_dir_path))
if f.exists()
and str(f.resolve()).startswith(str(datasets_dir_path))
and len(f.resolve().relative_to(datasets_dir_path).parts) > 1
and f.name != "dummy_data.zip"
}
Expand Down

1 comment on commit 08ed0a2

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007555 / 0.011353 (-0.003798) 0.003657 / 0.011008 (-0.007351) 0.028649 / 0.038508 (-0.009859) 0.029280 / 0.023109 (0.006170) 0.304433 / 0.275898 (0.028535) 0.369049 / 0.323480 (0.045569) 0.005433 / 0.007986 (-0.002552) 0.003067 / 0.004328 (-0.001261) 0.006685 / 0.004250 (0.002434) 0.041210 / 0.037052 (0.004158) 0.314035 / 0.258489 (0.055546) 0.358836 / 0.293841 (0.064995) 0.028805 / 0.128546 (-0.099741) 0.009379 / 0.075646 (-0.066267) 0.246916 / 0.419271 (-0.172356) 0.044243 / 0.043533 (0.000710) 0.303957 / 0.255139 (0.048818) 0.333752 / 0.283200 (0.050552) 0.088297 / 0.141683 (-0.053385) 1.519589 / 1.452155 (0.067434) 1.556536 / 1.492716 (0.063819)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.200443 / 0.018006 (0.182436) 0.416460 / 0.000490 (0.415971) 0.005619 / 0.000200 (0.005419) 0.000113 / 0.000054 (0.000059)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021958 / 0.037411 (-0.015453) 0.095322 / 0.014526 (0.080797) 0.105741 / 0.176557 (-0.070815) 0.148679 / 0.737135 (-0.588456) 0.106090 / 0.296338 (-0.190249)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.414340 / 0.215209 (0.199131) 4.110973 / 2.077655 (2.033318) 1.852557 / 1.504120 (0.348437) 1.656017 / 1.541195 (0.114822) 1.707204 / 1.468490 (0.238714) 0.440132 / 4.584777 (-4.144644) 3.365069 / 3.745712 (-0.380644) 3.513558 / 5.269862 (-1.756304) 1.610853 / 4.565676 (-2.954823) 0.052381 / 0.424275 (-0.371894) 0.010977 / 0.007607 (0.003370) 0.518677 / 0.226044 (0.292633) 5.233301 / 2.268929 (2.964372) 2.271349 / 55.444624 (-53.173276) 1.941073 / 6.876477 (-4.935403) 2.067671 / 2.142072 (-0.074401) 0.560291 / 4.805227 (-4.244936) 0.118077 / 6.500664 (-6.382588) 0.064583 / 0.075469 (-0.010886)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.480322 / 1.841788 (-0.361466) 12.727661 / 8.074308 (4.653353) 26.142530 / 10.191392 (15.951138) 0.840028 / 0.680424 (0.159604) 0.552248 / 0.534201 (0.018047) 0.346099 / 0.579283 (-0.233184) 0.400296 / 0.434364 (-0.034068) 0.228678 / 0.540337 (-0.311660) 0.236458 / 1.386936 (-1.150478)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005848 / 0.011353 (-0.005504) 0.003696 / 0.011008 (-0.007312) 0.026884 / 0.038508 (-0.011624) 0.028110 / 0.023109 (0.005001) 0.369042 / 0.275898 (0.093144) 0.438402 / 0.323480 (0.114922) 0.003582 / 0.007986 (-0.004403) 0.004202 / 0.004328 (-0.000127) 0.004600 / 0.004250 (0.000349) 0.036965 / 0.037052 (-0.000088) 0.382425 / 0.258489 (0.123936) 0.422686 / 0.293841 (0.128845) 0.027486 / 0.128546 (-0.101060) 0.009468 / 0.075646 (-0.066178) 0.249536 / 0.419271 (-0.169735) 0.046550 / 0.043533 (0.003018) 0.371457 / 0.255139 (0.116318) 0.399320 / 0.283200 (0.116121) 0.091942 / 0.141683 (-0.049741) 1.571177 / 1.452155 (0.119022) 1.638587 / 1.492716 (0.145871)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.225872 / 0.018006 (0.207865) 0.411863 / 0.000490 (0.411373) 0.004884 / 0.000200 (0.004684) 0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.020378 / 0.037411 (-0.017033) 0.091012 / 0.014526 (0.076486) 0.104311 / 0.176557 (-0.072245) 0.143105 / 0.737135 (-0.594030) 0.105906 / 0.296338 (-0.190432)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.474329 / 0.215209 (0.259120) 4.712588 / 2.077655 (2.634933) 2.473104 / 1.504120 (0.968984) 2.271169 / 1.541195 (0.729974) 2.332151 / 1.468490 (0.863661) 0.448298 / 4.584777 (-4.136479) 3.331272 / 3.745712 (-0.414440) 3.215612 / 5.269862 (-2.054250) 1.407275 / 4.565676 (-3.158401) 0.052516 / 0.424275 (-0.371759) 0.010964 / 0.007607 (0.003356) 0.574122 / 0.226044 (0.348077) 5.773919 / 2.268929 (3.504991) 2.883370 / 55.444624 (-52.561254) 2.548876 / 6.876477 (-4.327601) 2.662069 / 2.142072 (0.519997) 0.559566 / 4.805227 (-4.245661) 0.119077 / 6.500664 (-6.381587) 0.064432 / 0.075469 (-0.011037)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.558426 / 1.841788 (-0.283362) 12.896520 / 8.074308 (4.822212) 26.120173 / 10.191392 (15.928781) 0.907873 / 0.680424 (0.227449) 0.634480 / 0.534201 (0.100279) 0.343160 / 0.579283 (-0.236124) 0.394609 / 0.434364 (-0.039755) 0.230029 / 0.540337 (-0.310308) 0.236263 / 1.386936 (-1.150673)

CML watermark

Please sign in to comment.