Skip to content

Commit

Permalink
Bump minimum hfh to 0.2.0 and test minimum version
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq committed Sep 27, 2022
1 parent 49bb38f commit 99dde63
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 8 deletions.
14 changes: 7 additions & 7 deletions .github/workflows/ci.yml
Expand Up @@ -38,7 +38,7 @@ jobs:
matrix:
test: ['unit', 'integration']
os: [ubuntu-latest, windows-latest]
pyarrow_version: [latest, 6.0.1]
deps_versions: [latest, minimum]
continue-on-error: ${{ matrix.test == 'integration' }}
runs-on: ${{ matrix.os }}
steps:
Expand All @@ -63,12 +63,12 @@ jobs:
run: |
pip install .[tests]
pip install -r additional-tests-requirements.txt --no-deps
- name: Install latest PyArrow
if: ${{ matrix.pyarrow_version == 'latest' }}
run: pip install pyarrow --upgrade
- name: Install PyArrow ${{ matrix.pyarrow_version }}
if: ${{ matrix.pyarrow_version != 'latest' }}
run: pip install pyarrow==${{ matrix.pyarrow_version }}
- name: Install dependencies (latest versions)
if: ${{ matrix.deps_versions == 'latest' }}
run: pip install --upgrade pyarrow huggingface-hub
- name: Install depencencies (minimum versions)
if: ${{ matrix.deps_versions != 'latest' }}
run: pip install pyarrow==6.0.1 huggingface-hub==0.2.0
- name: Test with pytest
run: |
python -m pytest -rfExX -m ${{ matrix.test }} -n 2 --dist loadfile -sv ./tests/
Expand Down
3 changes: 2 additions & 1 deletion setup.py
Expand Up @@ -89,7 +89,8 @@
# for data streaming via http
"aiohttp",
# To get datasets from the Datasets Hub on huggingface.co
"huggingface-hub>=0.1.0,<1.0.0",
# minimum 0.2.0 for set_access_token
"huggingface-hub>=0.2.0,<1.0.0",
# Utilities from PyPA to e.g., compare versions
"packaging",
"responses<0.19",
Expand Down

1 comment on commit 99dde63

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009925 / 0.011353 (-0.001428) 0.004971 / 0.011008 (-0.006037) 0.043093 / 0.038508 (0.004584) 0.036444 / 0.023109 (0.013335) 0.358367 / 0.275898 (0.082469) 0.447614 / 0.323480 (0.124134) 0.006879 / 0.007986 (-0.001106) 0.005922 / 0.004328 (0.001594) 0.008303 / 0.004250 (0.004053) 0.054370 / 0.037052 (0.017318) 0.401721 / 0.258489 (0.143232) 0.449124 / 0.293841 (0.155283) 0.046084 / 0.128546 (-0.082462) 0.014578 / 0.075646 (-0.061068) 0.340334 / 0.419271 (-0.078937) 0.072998 / 0.043533 (0.029465) 0.359041 / 0.255139 (0.103902) 0.401148 / 0.283200 (0.117948) 0.105676 / 0.141683 (-0.036007) 1.768849 / 1.452155 (0.316694) 1.800244 / 1.492716 (0.307527)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.225232 / 0.018006 (0.207226) 0.534390 / 0.000490 (0.533900) 0.004568 / 0.000200 (0.004368) 0.000100 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023966 / 0.037411 (-0.013445) 0.109510 / 0.014526 (0.094984) 0.133983 / 0.176557 (-0.042574) 0.169182 / 0.737135 (-0.567953) 0.131558 / 0.296338 (-0.164781)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.591239 / 0.215209 (0.376030) 5.863246 / 2.077655 (3.785591) 2.320971 / 1.504120 (0.816851) 2.008839 / 1.541195 (0.467645) 1.973544 / 1.468490 (0.505054) 0.716105 / 4.584777 (-3.868672) 5.452224 / 3.745712 (1.706512) 2.938994 / 5.269862 (-2.330867) 2.163566 / 4.565676 (-2.402110) 0.084283 / 0.424275 (-0.339992) 0.012537 / 0.007607 (0.004930) 0.757722 / 0.226044 (0.531677) 7.397843 / 2.268929 (5.128915) 3.025123 / 55.444624 (-52.419502) 2.342577 / 6.876477 (-4.533900) 2.367723 / 2.142072 (0.225650) 0.911210 / 4.805227 (-3.894017) 0.178047 / 6.500664 (-6.322617) 0.072350 / 0.075469 (-0.003119)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.844835 / 1.841788 (0.003047) 15.927697 / 8.074308 (7.853389) 44.099836 / 10.191392 (33.908444) 1.112426 / 0.680424 (0.432002) 0.663189 / 0.534201 (0.128988) 0.469113 / 0.579283 (-0.110170) 0.606109 / 0.434364 (0.171745) 0.351586 / 0.540337 (-0.188752) 0.364605 / 1.386936 (-1.022331)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007649 / 0.011353 (-0.003704) 0.004964 / 0.011008 (-0.006044) 0.039234 / 0.038508 (0.000726) 0.034317 / 0.023109 (0.011208) 0.431229 / 0.275898 (0.155331) 0.490248 / 0.323480 (0.166768) 0.004999 / 0.007986 (-0.002987) 0.004042 / 0.004328 (-0.000287) 0.006093 / 0.004250 (0.001843) 0.045166 / 0.037052 (0.008114) 0.437696 / 0.258489 (0.179206) 0.518814 / 0.293841 (0.224973) 0.046060 / 0.128546 (-0.082486) 0.015012 / 0.075646 (-0.060634) 0.318860 / 0.419271 (-0.100411) 0.070462 / 0.043533 (0.026929) 0.422449 / 0.255139 (0.167310) 0.445746 / 0.283200 (0.162547) 0.111066 / 0.141683 (-0.030617) 1.777043 / 1.452155 (0.324888) 1.762258 / 1.492716 (0.269541)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.302424 / 0.018006 (0.284418) 0.511372 / 0.000490 (0.510882) 0.016962 / 0.000200 (0.016762) 0.000191 / 0.000054 (0.000137)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025538 / 0.037411 (-0.011873) 0.109345 / 0.014526 (0.094819) 0.121863 / 0.176557 (-0.054693) 0.169837 / 0.737135 (-0.567298) 0.125511 / 0.296338 (-0.170828)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.648693 / 0.215209 (0.433483) 6.608426 / 2.077655 (4.530772) 2.920987 / 1.504120 (1.416867) 2.278167 / 1.541195 (0.736972) 2.353479 / 1.468490 (0.884989) 0.767275 / 4.584777 (-3.817502) 5.586597 / 3.745712 (1.840885) 2.972360 / 5.269862 (-2.297501) 1.959986 / 4.565676 (-2.605691) 0.090834 / 0.424275 (-0.333441) 0.014065 / 0.007607 (0.006458) 0.827715 / 0.226044 (0.601671) 8.163700 / 2.268929 (5.894771) 3.481433 / 55.444624 (-51.963192) 2.711858 / 6.876477 (-4.164618) 2.823233 / 2.142072 (0.681160) 0.944363 / 4.805227 (-3.860864) 0.190546 / 6.500664 (-6.310118) 0.076274 / 0.075469 (0.000805)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.919629 / 1.841788 (0.077841) 15.699204 / 8.074308 (7.624896) 43.651733 / 10.191392 (33.460341) 1.140524 / 0.680424 (0.460100) 0.737715 / 0.534201 (0.203514) 0.478499 / 0.579283 (-0.100784) 0.606225 / 0.434364 (0.171861) 0.360363 / 0.540337 (-0.179974) 0.359212 / 1.386936 (-1.027724)

CML watermark

Please sign in to comment.