Skip to content

Commit

Permalink
Drop Python 3.6 support (#4460)
Browse files Browse the repository at this point in the history
* Remove python 3.6 code

* Update requirements

* Style

* Update audio gh action

* Benchmarks fix attempt #1

* Benchmarks fix attempt no.2

* Use newer image

* Remove backticks

* Add suggested command to benchmark action

* Avoid some FutureWarnings and DeprecationWarnings

* Disable test

* Remove 3.6 pickling test

* CI test

* Use python 3.7 in ubuntu-latest

* Disable s3 test on Linux

* Remove weird json file

* Remove cloudpickle stuff

* Use lower torchaudio version

* Try to fix s3 errors

* Another attempt

* Disable test
  • Loading branch information
mariosasko committed Jul 26, 2022
1 parent 10b1355 commit 75e6b74
Show file tree
Hide file tree
Showing 15 changed files with 55 additions and 168 deletions.
1 change: 0 additions & 1 deletion .github/hub/update_hub_repositories.py
@@ -1,4 +1,3 @@
import base64
import distutils.dir_util
import logging
import os
Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/benchmarks.yaml
Expand Up @@ -3,13 +3,16 @@ on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
container: docker://dvcorg/cml-py3:latest
container: docker://dvcorg/cml:latest
steps:
- uses: actions/checkout@v2
- name: cml_run
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
run: |
# See https://github.com/actions/checkout/issues/760
git config --global --add safe.directory /__w/datasets/datasets
# Your ML workflow goes here
pip install --upgrade pip
Expand Down
10 changes: 2 additions & 8 deletions .github/workflows/ci.yml
Expand Up @@ -21,7 +21,7 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.6"
python-version: "3.7"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand Down Expand Up @@ -49,21 +49,15 @@ jobs:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Set up Python 3.6
if: ${{ matrix.os == 'ubuntu-latest' }}
uses: actions/setup-python@v4
with:
python-version: 3.6
- name: Set up Python 3.7
if: ${{ matrix.os == 'windows-latest' }}
uses: actions/setup-python@v4
with:
python-version: 3.7
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Pin setuptools-scm
if: ${{ matrix.os == 'ubuntu-latest' }}
run: echo "installing pinned version of setuptools-scm to fix seqeval installation on 3.6" && pip install "setuptools-scm==6.4.2"
run: echo "installing pinned version of setuptools-scm to fix seqeval installation on 3.7" && pip install "setuptools-scm==6.4.2"
- name: Install dependencies
run: |
pip install .[tests]
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Expand Up @@ -3,14 +3,14 @@
# Check that source code meets quality standards

quality:
black --check --line-length 119 --target-version py36 tests src benchmarks datasets/**/*.py metrics
black --check --line-length 119 --target-version py37 tests src benchmarks datasets/**/*.py metrics
isort --check-only tests src benchmarks datasets/**/*.py metrics
flake8 tests src benchmarks datasets/**/*.py metrics

# Format source code automatically

style:
black --line-length 119 --target-version py36 tests src benchmarks datasets/**/*.py metrics
black --line-length 119 --target-version py37 tests src benchmarks datasets/**/*.py metrics
isort tests src benchmarks datasets/**/*.py metrics

# Run tests for the library
Expand Down
2 changes: 1 addition & 1 deletion additional-tests-requirements.txt
@@ -1,4 +1,4 @@
unbabel-comet>=1.0.0;python_version>'3.6'
unbabel-comet>=1.0.0
git+https://github.com/google-research/bleurt.git
git+https://github.com/ns-moosavi/coval.git
git+https://github.com/hendrycks/math.git
2 changes: 1 addition & 1 deletion docs/source/installation.md
@@ -1,6 +1,6 @@
# Installation

Before you start, you'll need to setup your environment and install the appropriate packages. 馃 Datasets is tested on **Python 3.6+**.
Before you start, you'll need to setup your environment and install the appropriate packages. 馃 Datasets is tested on **Python 3.7+**.

<Tip>

Expand Down
11 changes: 3 additions & 8 deletions setup.py
Expand Up @@ -55,7 +55,6 @@
Then push the change with a message 'set dev version'
"""

import os

from setuptools import find_packages, setup

Expand All @@ -74,8 +73,6 @@
"requests>=2.19.0",
# progress bars in download and scripts
"tqdm>=4.62.1",
# dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'",
# for fast hashing
"xxhash",
# for better multiprocessing
Expand Down Expand Up @@ -105,7 +102,7 @@
BENCHMARKS_REQUIRE = [
"numpy==1.18.5",
"tensorflow==2.3.0",
"torch==1.6.0",
"torch==1.7.1",
"transformers==3.0.2",
]

Expand All @@ -128,7 +125,7 @@
"s3fs>=2021.11.1", # aligned with fsspec[http]>=2021.11.1
"tensorflow>=2.3,!=2.6.0,!=2.6.1",
"torch",
"torchaudio",
"torchaudio<0.12.0",
"soundfile",
"transformers",
# datasets dependencies
Expand Down Expand Up @@ -165,8 +162,6 @@
"texttable>=1.6.3",
"Werkzeug>=1.0.1",
"six~=1.15.0",
# metadata validation
"importlib_resources;python_version<'3.7'",
]

TESTS_REQUIRE.extend(VISION_REQURE)
Expand Down Expand Up @@ -214,6 +209,7 @@
packages=find_packages("src"),
package_data={"datasets": ["py.typed", "scripts/templates/*"], "datasets.utils.resources": ["*.json", "*.yaml", "*.tsv"]},
entry_points={"console_scripts": ["datasets-cli=datasets.commands.datasets_cli:main"]},
python_requires=">=3.7.0",
install_requires=REQUIRED_PKGS,
extras_require=EXTRAS_REQUIRE,
classifiers=[
Expand All @@ -224,7 +220,6 @@
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
Expand Down
8 changes: 8 additions & 0 deletions src/datasets/__init__.py
Expand Up @@ -19,10 +19,17 @@

__version__ = "2.4.1.dev0"

import platform

import pyarrow
from packaging import version


if version.parse(platform.python_version()) < version.parse("3.7"):
raise ImportWarning(
"To use `datasets`, Python>=3.7 is required, and the current version of Python doesn't match this condition."
)

if version.parse(pyarrow.__version__).major < 6:
raise ImportWarning(
"To use `datasets`, the module `pyarrow>=6.0.0` is required, and the current version of `pyarrow` doesn't match this condition.\n"
Expand All @@ -31,6 +38,7 @@

SCRIPTS_VERSION = "main" if version.parse(__version__).is_devrelease else __version__

del platform
del pyarrow
del version

Expand Down
2 changes: 1 addition & 1 deletion src/datasets/features/features.py
Expand Up @@ -824,7 +824,7 @@ def __getitem__(self, item: Union[int, slice, np.ndarray]) -> Union[np.ndarray,
def take(
self, indices: Sequence_[int], allow_fill: bool = False, fill_value: bool = None
) -> "PandasArrayExtensionArray":
indices: np.ndarray = np.asarray(indices, dtype=np.int)
indices: np.ndarray = np.asarray(indices, dtype=int)
if allow_fill:
fill_value = (
self.dtype.na_value if fill_value is None else np.asarray(fill_value, dtype=self.dtype.value_type)
Expand Down
58 changes: 1 addition & 57 deletions src/datasets/utils/py_utils.py
Expand Up @@ -22,17 +22,15 @@
import functools
import itertools
import os
import pickle
import re
import sys
import types
from contextlib import contextmanager
from dataclasses import fields, is_dataclass
from io import BytesIO as StringIO
from multiprocessing import Pool, RLock
from shutil import disk_usage
from types import CodeType, FunctionType
from typing import Callable, ClassVar, Dict, Generic, List, Optional, Tuple, Union
from typing import Dict, List, Optional, Tuple, Union
from urllib.parse import urlparse

import dill
Expand Down Expand Up @@ -552,19 +550,6 @@ class Pickler(dill.Pickler):

dispatch = dill._dill.MetaCatchingDict(dill.Pickler.dispatch.copy())

def save_global(self, obj, name=None):
if sys.version_info[:2] < (3, 7) and _CloudPickleTypeHintFix._is_parametrized_type_hint(
obj
): # noqa # pragma: no branch
# Parametrized typing constructs in Python < 3.7 are not compatible
# with type checks and ``isinstance`` semantics. For this reason,
# it is easier to detect them using a duck-typing-based check
# (``_is_parametrized_type_hint``) than to populate the Pickler's
# dispatch with type-specific savers.
_CloudPickleTypeHintFix._save_parametrized_type_hint(self, obj)
else:
dill.Pickler.save_global(self, obj, name=name)

def memoize(self, obj):
# don't memoize strings since two identical strings can have different python ids
if type(obj) != str:
Expand Down Expand Up @@ -610,47 +595,6 @@ def proxy(func):
return proxy


class _CloudPickleTypeHintFix:
"""
Type hints can't be properly pickled in python < 3.7
CloudPickle provided a way to make it work in older versions.
This class provide utilities to fix pickling of type hints in older versions.
from https://github.com/cloudpipe/cloudpickle/pull/318/files
"""

def _is_parametrized_type_hint(obj):
# This is very cheap but might generate false positives.
origin = getattr(obj, "__origin__", None) # typing Constructs
values = getattr(obj, "__values__", None) # typing_extensions.Literal
type_ = getattr(obj, "__type__", None) # typing_extensions.Final
return origin is not None or values is not None or type_ is not None

def _create_parametrized_type_hint(origin, args):
return origin[args]

def _save_parametrized_type_hint(pickler, obj):
# The distorted type check sematic for typing construct becomes:
# ``type(obj) is type(TypeHint)``, which means "obj is a
# parametrized TypeHint"
if type(obj) is type(Literal): # pragma: no branch
initargs = (Literal, obj.__values__)
elif type(obj) is type(Final): # pragma: no branch
initargs = (Final, obj.__type__)
elif type(obj) is type(ClassVar):
initargs = (ClassVar, obj.__type__)
elif type(obj) in [type(Union), type(Tuple), type(Generic)]:
initargs = (obj.__origin__, obj.__args__)
elif type(obj) is type(Callable):
args = obj.__args__
if args[0] is Ellipsis:
initargs = (obj.__origin__, args)
else:
initargs = (obj.__origin__, (list(args[:-1]), args[-1]))
else: # pragma: no cover
raise pickle.PicklingError(f"Datasets pickle Error: Unknown type {type(obj)}")
pickler.save_reduce(_CloudPickleTypeHintFix._create_parametrized_type_hint, initargs, obj=obj)


@pklregister(CodeType)
def _save_code(pickler, obj):
"""
Expand Down
51 changes: 15 additions & 36 deletions tests/commands/test_dummy_data.py
@@ -1,45 +1,24 @@
import os
from collections import namedtuple
from dataclasses import dataclass

from packaging import version

from datasets import config
from datasets.commands.dummy_data import DummyDataCommand


if config.PY_VERSION >= version.parse("3.7"):
DummyDataCommandArgs = namedtuple(
"DummyDataCommandArgs",
[
"path_to_dataset",
"auto_generate",
"n_lines",
"json_field",
"xml_tag",
"match_text_files",
"keep_uncompressed",
"cache_dir",
"encoding",
],
defaults=[False, 5, None, None, None, False, None, None],
)
else:

@dataclass
class DummyDataCommandArgs:
path_to_dataset: str
auto_generate: bool = False
n_lines: int = 5
json_field: str = None
xml_tag: str = None
match_text_files: str = None
keep_uncompressed: bool = False
cache_dir: str = None
encoding: str = None

def __iter__(self):
return iter(self.__dict__.values())
DummyDataCommandArgs = namedtuple(
"DummyDataCommandArgs",
[
"path_to_dataset",
"auto_generate",
"n_lines",
"json_field",
"xml_tag",
"match_text_files",
"keep_uncompressed",
"cache_dir",
"encoding",
],
defaults=[False, 5, None, None, None, False, None, None],
)


class MockDummyDataCommand(DummyDataCommand):
Expand Down
50 changes: 15 additions & 35 deletions tests/commands/test_test.py
@@ -1,46 +1,26 @@
import json
import os
from collections import namedtuple
from dataclasses import dataclass

from packaging import version

from datasets import config
from datasets.commands.test import TestCommand


if config.PY_VERSION >= version.parse("3.7"):
_TestCommandArgs = namedtuple(
"_TestCommandArgs",
[
"dataset",
"name",
"cache_dir",
"data_dir",
"all_configs",
"save_infos",
"ignore_verifications",
"force_redownload",
"clear_cache",
],
defaults=[None, None, None, False, False, False, False, False],
)
else:

@dataclass
class _TestCommandArgs:
dataset: str
name: str = None
cache_dir: str = None
data_dir: str = None
all_configs: bool = False
save_infos: bool = False
ignore_verifications: bool = False
force_redownload: bool = False
clear_cache: bool = False

def __iter__(self):
return iter(self.__dict__.values())
_TestCommandArgs = namedtuple(
"_TestCommandArgs",
[
"dataset",
"name",
"cache_dir",
"data_dir",
"all_configs",
"save_infos",
"ignore_verifications",
"force_redownload",
"clear_cache",
],
defaults=[None, None, None, False, False, False, False, False],
)


def test_test_command(dataset_loading_script_dir):
Expand Down
2 changes: 1 addition & 1 deletion tests/test_arrow_dataset.py
Expand Up @@ -3118,7 +3118,7 @@ def test_pickle_dataset_after_transforming_the_table(in_memory, method_and_param


@pytest.mark.skipif(
os.name == "nt" and (os.getenv("CIRCLECI") == "true" or os.getenv("GITHUB_ACTIONS") == "true"),
os.name in ["nt", "posix"] and (os.getenv("CIRCLECI") == "true" or os.getenv("GITHUB_ACTIONS") == "true"),
reason='On Windows CircleCI or GitHub Actions, it raises botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5555/test"',
) # TODO: find what's wrong with CircleCI / GitHub Actions
@require_s3
Expand Down

1 comment on commit 75e6b74

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007443 / 0.011353 (-0.003910) 0.003637 / 0.011008 (-0.007371) 0.028211 / 0.038508 (-0.010297) 0.029301 / 0.023109 (0.006192) 0.306236 / 0.275898 (0.030337) 0.338746 / 0.323480 (0.015266) 0.005442 / 0.007986 (-0.002543) 0.003048 / 0.004328 (-0.001280) 0.006498 / 0.004250 (0.002248) 0.035263 / 0.037052 (-0.001790) 0.317752 / 0.258489 (0.059263) 0.387229 / 0.293841 (0.093388) 0.028755 / 0.128546 (-0.099792) 0.009316 / 0.075646 (-0.066331) 0.244598 / 0.419271 (-0.174674) 0.045146 / 0.043533 (0.001613) 0.311775 / 0.255139 (0.056636) 0.337944 / 0.283200 (0.054744) 0.088855 / 0.141683 (-0.052828) 1.496429 / 1.452155 (0.044274) 1.531159 / 1.492716 (0.038443)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.201073 / 0.018006 (0.183066) 0.411767 / 0.000490 (0.411277) 0.003673 / 0.000200 (0.003473) 0.000076 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.021532 / 0.037411 (-0.015880) 0.091954 / 0.014526 (0.077428) 0.104948 / 0.176557 (-0.071608) 0.147945 / 0.737135 (-0.589190) 0.106196 / 0.296338 (-0.190143)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.412284 / 0.215209 (0.197075) 4.105855 / 2.077655 (2.028200) 1.850861 / 1.504120 (0.346741) 1.656411 / 1.541195 (0.115216) 1.706361 / 1.468490 (0.237871) 0.451462 / 4.584777 (-4.133315) 3.358766 / 3.745712 (-0.386946) 3.350529 / 5.269862 (-1.919332) 1.667967 / 4.565676 (-2.897709) 0.053176 / 0.424275 (-0.371099) 0.011193 / 0.007607 (0.003585) 0.516491 / 0.226044 (0.290446) 5.153469 / 2.268929 (2.884540) 2.276831 / 55.444624 (-53.167794) 1.928199 / 6.876477 (-4.948278) 2.011626 / 2.142072 (-0.130446) 0.563538 / 4.805227 (-4.241689) 0.117250 / 6.500664 (-6.383414) 0.062300 / 0.075469 (-0.013169)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.507849 / 1.841788 (-0.333939) 12.639719 / 8.074308 (4.565411) 25.790324 / 10.191392 (15.598932) 0.881571 / 0.680424 (0.201148) 0.616830 / 0.534201 (0.082629) 0.345950 / 0.579283 (-0.233333) 0.392830 / 0.434364 (-0.041534) 0.237009 / 0.540337 (-0.303328) 0.235008 / 1.386936 (-1.151928)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005218 / 0.011353 (-0.006135) 0.003429 / 0.011008 (-0.007580) 0.027082 / 0.038508 (-0.011426) 0.026926 / 0.023109 (0.003817) 0.300759 / 0.275898 (0.024861) 0.362158 / 0.323480 (0.038678) 0.003145 / 0.007986 (-0.004840) 0.004044 / 0.004328 (-0.000284) 0.004589 / 0.004250 (0.000338) 0.033428 / 0.037052 (-0.003625) 0.307868 / 0.258489 (0.049379) 0.356863 / 0.293841 (0.063022) 0.026071 / 0.128546 (-0.102476) 0.009285 / 0.075646 (-0.066362) 0.249764 / 0.419271 (-0.169508) 0.045794 / 0.043533 (0.002261) 0.306907 / 0.255139 (0.051768) 0.338089 / 0.283200 (0.054890) 0.086854 / 0.141683 (-0.054829) 1.472315 / 1.452155 (0.020161) 1.515112 / 1.492716 (0.022395)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.214860 / 0.018006 (0.196854) 0.407910 / 0.000490 (0.407420) 0.002522 / 0.000200 (0.002322) 0.000068 / 0.000054 (0.000014)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022538 / 0.037411 (-0.014873) 0.093905 / 0.014526 (0.079379) 0.106768 / 0.176557 (-0.069788) 0.145679 / 0.737135 (-0.591456) 0.107904 / 0.296338 (-0.188435)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.419733 / 0.215209 (0.204524) 4.186076 / 2.077655 (2.108422) 1.909746 / 1.504120 (0.405626) 1.712498 / 1.541195 (0.171304) 1.698208 / 1.468490 (0.229718) 0.449030 / 4.584777 (-4.135747) 3.379613 / 3.745712 (-0.366099) 1.804818 / 5.269862 (-3.465044) 1.086305 / 4.565676 (-3.479371) 0.053008 / 0.424275 (-0.371267) 0.010687 / 0.007607 (0.003080) 0.522182 / 0.226044 (0.296137) 5.239198 / 2.268929 (2.970269) 2.351364 / 55.444624 (-53.093260) 2.007474 / 6.876477 (-4.869002) 2.064110 / 2.142072 (-0.077963) 0.557781 / 4.805227 (-4.247447) 0.119885 / 6.500664 (-6.380779) 0.064894 / 0.075469 (-0.010575)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.518801 / 1.841788 (-0.322987) 12.380097 / 8.074308 (4.305789) 26.143738 / 10.191392 (15.952346) 0.863504 / 0.680424 (0.183080) 0.600542 / 0.534201 (0.066341) 0.347263 / 0.579283 (-0.232021) 0.401762 / 0.434364 (-0.032602) 0.243870 / 0.540337 (-0.296468) 0.249442 / 1.386936 (-1.137494)

CML watermark

Please sign in to comment.