Skip to content

Commit

Permalink
Fix languages of X-CSQA configs in xcsr dataset (#5022)
Browse files Browse the repository at this point in the history
* Fix languages of X-CSQA configs in xcsr dataset

* Update metadata JSON
  • Loading branch information
albertvillanova committed Sep 26, 2022
1 parent 365773b commit 5eefb56
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion datasets/xcsr/dataset_infos.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion datasets/xcsr/xcsr.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ class Xcsr(datasets.GeneratorBasedBuilder):
BUILDER_CONFIGS = [
XcsrConfig(
name="X-CSQA-" + lang,
language="en",
language=lang,
version=datasets.Version("1.1.0", ""),
description=f"Plain text import of X-CSQA for the {lang} language",
)
Expand Down

1 comment on commit 5eefb56

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.010722 / 0.011353 (-0.000631) 0.008432 / 0.011008 (-0.002576) 0.039361 / 0.038508 (0.000853) 0.040167 / 0.023109 (0.017058) 0.390446 / 0.275898 (0.114548) 0.483708 / 0.323480 (0.160228) 0.007860 / 0.007986 (-0.000126) 0.004859 / 0.004328 (0.000530) 0.008675 / 0.004250 (0.004424) 0.057286 / 0.037052 (0.020234) 0.432917 / 0.258489 (0.174428) 0.487662 / 0.293841 (0.193821) 0.063470 / 0.128546 (-0.065077) 0.014971 / 0.075646 (-0.060675) 0.347317 / 0.419271 (-0.071955) 0.069854 / 0.043533 (0.026321) 0.419256 / 0.255139 (0.164117) 0.451463 / 0.283200 (0.168263) 0.115219 / 0.141683 (-0.026464) 1.860623 / 1.452155 (0.408469) 1.925123 / 1.492716 (0.432407)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.285945 / 0.018006 (0.267939) 0.669433 / 0.000490 (0.668944) 0.004513 / 0.000200 (0.004313) 0.000118 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028659 / 0.037411 (-0.008753) 0.129561 / 0.014526 (0.115035) 0.139843 / 0.176557 (-0.036713) 0.203436 / 0.737135 (-0.533699) 0.138619 / 0.296338 (-0.157719)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.578289 / 0.215209 (0.363080) 6.185475 / 2.077655 (4.107821) 2.480871 / 1.504120 (0.976751) 2.122677 / 1.541195 (0.581482) 2.275624 / 1.468490 (0.807134) 0.827050 / 4.584777 (-3.757727) 5.858199 / 3.745712 (2.112486) 5.121554 / 5.269862 (-0.148308) 2.745623 / 4.565676 (-1.820054) 0.086347 / 0.424275 (-0.337928) 0.014482 / 0.007607 (0.006875) 0.769707 / 0.226044 (0.543663) 7.741476 / 2.268929 (5.472547) 3.179407 / 55.444624 (-52.265217) 2.524397 / 6.876477 (-4.352080) 2.701572 / 2.142072 (0.559500) 0.916216 / 4.805227 (-3.889011) 0.190689 / 6.500664 (-6.309975) 0.078526 / 0.075469 (0.003057)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.912122 / 1.841788 (0.070334) 17.452193 / 8.074308 (9.377885) 41.253855 / 10.191392 (31.062463) 1.212924 / 0.680424 (0.532500) 0.736850 / 0.534201 (0.202650) 0.528273 / 0.579283 (-0.051010) 0.654740 / 0.434364 (0.220376) 0.363218 / 0.540337 (-0.177120) 0.380473 / 1.386936 (-1.006463)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009205 / 0.011353 (-0.002148) 0.004814 / 0.011008 (-0.006195) 0.037287 / 0.038508 (-0.001221) 0.037173 / 0.023109 (0.014064) 0.473832 / 0.275898 (0.197934) 0.533160 / 0.323480 (0.209680) 0.004929 / 0.007986 (-0.003057) 0.004349 / 0.004328 (0.000021) 0.006132 / 0.004250 (0.001882) 0.053393 / 0.037052 (0.016340) 0.469588 / 0.258489 (0.211099) 0.526039 / 0.293841 (0.232198) 0.059497 / 0.128546 (-0.069049) 0.018961 / 0.075646 (-0.056685) 0.353608 / 0.419271 (-0.065664) 0.075946 / 0.043533 (0.032413) 0.464790 / 0.255139 (0.209651) 0.504253 / 0.283200 (0.221053) 0.121285 / 0.141683 (-0.020398) 1.844816 / 1.452155 (0.392661) 1.909801 / 1.492716 (0.417084)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.348622 / 0.018006 (0.330616) 0.587967 / 0.000490 (0.587477) 0.022483 / 0.000200 (0.022283) 0.000187 / 0.000054 (0.000133)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.028461 / 0.037411 (-0.008950) 0.121828 / 0.014526 (0.107303) 0.147503 / 0.176557 (-0.029054) 0.182750 / 0.737135 (-0.554385) 0.139871 / 0.296338 (-0.156468)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.658486 / 0.215209 (0.443277) 6.619348 / 2.077655 (4.541694) 3.019685 / 1.504120 (1.515565) 2.638397 / 1.541195 (1.097203) 2.651036 / 1.468490 (1.182546) 0.731144 / 4.584777 (-3.853633) 5.672239 / 3.745712 (1.926527) 5.409598 / 5.269862 (0.139737) 2.715551 / 4.565676 (-1.850125) 0.087235 / 0.424275 (-0.337040) 0.014001 / 0.007607 (0.006394) 0.833041 / 0.226044 (0.606997) 8.379768 / 2.268929 (6.110839) 3.737860 / 55.444624 (-51.706764) 3.113777 / 6.876477 (-3.762700) 3.270169 / 2.142072 (1.128096) 0.906191 / 4.805227 (-3.899037) 0.194648 / 6.500664 (-6.306016) 0.091740 / 0.075469 (0.016271)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 2.076695 / 1.841788 (0.234907) 17.069404 / 8.074308 (8.995096) 42.681320 / 10.191392 (32.489928) 1.202551 / 0.680424 (0.522127) 0.826841 / 0.534201 (0.292640) 0.494548 / 0.579283 (-0.084735) 0.634286 / 0.434364 (0.199922) 0.351615 / 0.540337 (-0.188722) 0.371583 / 1.386936 (-1.015353)

CML watermark

Please sign in to comment.