Skip to content

Commit

Permalink
hf:// -> hf-legacy://
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Oct 11, 2022
1 parent bef23be commit f43aeb0
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datasets/filesystems/hffilesystem.py
Expand Up @@ -12,7 +12,7 @@ class HfFileSystem(AbstractFileSystem):
"""Interface to files in a Hugging face repository"""

root_marker = ""
protocol = "hf"
protocol = "hf-legacy" # "hf://"" is reserved for hffs

def __init__(
self,
Expand Down

1 comment on commit f43aeb0

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008612 / 0.011353 (-0.002740) 0.004659 / 0.011008 (-0.006349) 0.099285 / 0.038508 (0.060777) 0.029887 / 0.023109 (0.006778) 0.309747 / 0.275898 (0.033849) 0.367266 / 0.323480 (0.043786) 0.007078 / 0.007986 (-0.000907) 0.003523 / 0.004328 (-0.000806) 0.077836 / 0.004250 (0.073586) 0.038440 / 0.037052 (0.001388) 0.312745 / 0.258489 (0.054256) 0.352506 / 0.293841 (0.058665) 0.036839 / 0.128546 (-0.091707) 0.014287 / 0.075646 (-0.061359) 0.327674 / 0.419271 (-0.091597) 0.043052 / 0.043533 (-0.000481) 0.305842 / 0.255139 (0.050703) 0.335234 / 0.283200 (0.052034) 0.087364 / 0.141683 (-0.054319) 1.498320 / 1.452155 (0.046166) 1.517677 / 1.492716 (0.024960)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.198427 / 0.018006 (0.180421) 0.418512 / 0.000490 (0.418022) 0.002924 / 0.000200 (0.002724) 0.000071 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021149 / 0.037411 (-0.016262) 0.091547 / 0.014526 (0.077021) 0.105675 / 0.176557 (-0.070881) 0.148781 / 0.737135 (-0.588355) 0.105889 / 0.296338 (-0.190449)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.412667 / 0.215209 (0.197458) 4.132622 / 2.077655 (2.054967) 1.859357 / 1.504120 (0.355237) 1.653137 / 1.541195 (0.111942) 1.718332 / 1.468490 (0.249842) 0.691844 / 4.584777 (-3.892933) 3.285676 / 3.745712 (-0.460036) 1.896259 / 5.269862 (-3.373602) 1.271801 / 4.565676 (-3.293875) 0.080857 / 0.424275 (-0.343418) 0.011826 / 0.007607 (0.004219) 0.523584 / 0.226044 (0.297539) 5.245043 / 2.268929 (2.976115) 2.324504 / 55.444624 (-53.120121) 1.982280 / 6.876477 (-4.894196) 2.049921 / 2.142072 (-0.092151) 0.804411 / 4.805227 (-4.000816) 0.148474 / 6.500664 (-6.352190) 0.065238 / 0.075469 (-0.010231)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.480472 / 1.841788 (-0.361316) 12.555973 / 8.074308 (4.481665) 26.196465 / 10.191392 (16.005073) 0.848259 / 0.680424 (0.167835) 0.569504 / 0.534201 (0.035303) 0.382734 / 0.579283 (-0.196550) 0.395532 / 0.434364 (-0.038832) 0.228793 / 0.540337 (-0.311544) 0.233263 / 1.386936 (-1.153673)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007146 / 0.011353 (-0.004207) 0.004718 / 0.011008 (-0.006290) 0.098265 / 0.038508 (0.059757) 0.028612 / 0.023109 (0.005503) 0.340322 / 0.275898 (0.064424) 0.380395 / 0.323480 (0.056915) 0.005450 / 0.007986 (-0.002536) 0.003633 / 0.004328 (-0.000695) 0.074659 / 0.004250 (0.070409) 0.036548 / 0.037052 (-0.000505) 0.340107 / 0.258489 (0.081618) 0.387493 / 0.293841 (0.093652) 0.034104 / 0.128546 (-0.094442) 0.011907 / 0.075646 (-0.063739) 0.326112 / 0.419271 (-0.093159) 0.042786 / 0.043533 (-0.000747) 0.339166 / 0.255139 (0.084027) 0.365531 / 0.283200 (0.082332) 0.089519 / 0.141683 (-0.052164) 1.469403 / 1.452155 (0.017248) 1.545417 / 1.492716 (0.052701)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.168085 / 0.018006 (0.150079) 0.421928 / 0.000490 (0.421438) 0.002752 / 0.000200 (0.002552) 0.000079 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021043 / 0.037411 (-0.016368) 0.096061 / 0.014526 (0.081535) 0.105153 / 0.176557 (-0.071403) 0.145074 / 0.737135 (-0.592062) 0.109522 / 0.296338 (-0.186816)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.435719 / 0.215209 (0.220510) 4.338263 / 2.077655 (2.260608) 2.065203 / 1.504120 (0.561084) 1.860084 / 1.541195 (0.318889) 1.906243 / 1.468490 (0.437752) 0.698185 / 4.584777 (-3.886592) 3.312660 / 3.745712 (-0.433053) 1.882801 / 5.269862 (-3.387060) 1.171235 / 4.565676 (-3.394442) 0.081490 / 0.424275 (-0.342786) 0.011523 / 0.007607 (0.003916) 0.534655 / 0.226044 (0.308610) 5.365464 / 2.268929 (3.096536) 2.483321 / 55.444624 (-52.961304) 2.150492 / 6.876477 (-4.725985) 2.275889 / 2.142072 (0.133816) 0.803000 / 4.805227 (-4.002227) 0.147664 / 6.500664 (-6.353000) 0.064363 / 0.075469 (-0.011106)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.574588 / 1.841788 (-0.267200) 13.070501 / 8.074308 (4.996193) 12.305020 / 10.191392 (2.113628) 0.904229 / 0.680424 (0.223805) 0.631096 / 0.534201 (0.096896) 0.372811 / 0.579283 (-0.206473) 0.373162 / 0.434364 (-0.061202) 0.228054 / 0.540337 (-0.312283) 0.232625 / 1.386936 (-1.154311)

CML watermark

Please sign in to comment.