Skip to content

Commit

Permalink
[SPARK-43348][PYTHON] Support Python 3.8 in PyPy3
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This PR aims two goals.
1. Make PySpark support Python 3.8+ with PyPy3
2. Upgrade PyPy3 to Python 3.8 in our GitHub Action Infra Image to enable test coverage

Note that there was one failure at `test_create_dataframe_from_pandas_with_day_time_interval` test case. This PR skips the test case and SPARK-43354 will recover it after further investigation.

### Why are the changes needed?

Previously, PySpark fails at PyPy3 `Python 3.8` environment.
```
pypy3 version is: Python 3.8.16 (a9dbdca6fc3286b0addd2240f11d97d8e8de187a, Dec 29 2022, 11:45:13)
[PyPy 7.3.11 with GCC 10.2.1 20210130 (Red Hat 10.2.1-11)]
Starting test(pypy3): pyspark.sql.tests.pandas.test_pandas_cogrouped_map (temp output: /__w/spark/spark/python/target/f1cacde7-d369-48cf-a8ea-724c42872020/pypy3__pyspark.sql.tests.pandas.test_pandas_cogrouped_map__rxih6dqu.log)
Traceback (most recent call last):
  File "/usr/local/pypy/pypy3.8/lib/pypy3.8/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/local/pypy/pypy3.8/lib/pypy3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/__w/spark/spark/python/pyspark/__init__.py", line 59, in <module>
    from pyspark.rdd import RDD, RDDBarrier
  File "/__w/spark/spark/python/pyspark/rdd.py", line 54, in <module>
    from pyspark.java_gateway import local_connect_and_auth
  File "/__w/spark/spark/python/pyspark/java_gateway.py", line 32, in <module>
    from pyspark.serializers import read_int, write_with_length, UTF8Deserializer
  File "/__w/spark/spark/python/pyspark/serializers.py", line 69, in <module>
    from pyspark import cloudpickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/__init__.py", line 1, in <module>
    from pyspark.cloudpickle.cloudpickle import *  # noqa
  File "/__w/spark/spark/python/pyspark/cloudpickle/cloudpickle.py", line 56, in <module>
    from .compat import pickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/compat.py", line 13, in <module>
    from _pickle import Pickler  # noqa: F401
ModuleNotFoundError: No module named '_pickle'
```

To support Python 3.8 in PyPy3.
- From PyPy3.8, `_pickle` is removed.
  - cloudpipe/cloudpickle#458
- We need this change.
  - cloudpipe/cloudpickle#469

### Does this PR introduce _any_ user-facing change?

This is an additional support.

### How was this patch tested?

Pass the CIs.

Closes apache#41024 from dongjoon-hyun/SPARK-43348.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
  • Loading branch information
dongjoon-hyun authored and LuciferYang committed May 10, 2023
1 parent bb683b4 commit ca910a3
Show file tree
Hide file tree
Showing 3 changed files with 8 additions and 15 deletions.
8 changes: 4 additions & 4 deletions dev/infra/Dockerfile
Expand Up @@ -38,10 +38,10 @@ RUN apt update
RUN $APT_INSTALL gfortran libopenblas-dev liblapack-dev
RUN $APT_INSTALL build-essential

RUN mkdir -p /usr/local/pypy/pypy3.7 && \
curl -sqL https://downloads.python.org/pypy/pypy3.7-v7.3.7-linux64.tar.bz2 | tar xjf - -C /usr/local/pypy/pypy3.7 --strip-components=1 && \
ln -sf /usr/local/pypy/pypy3.7/bin/pypy /usr/local/bin/pypy3.7 && \
ln -sf /usr/local/pypy/pypy3.7/bin/pypy /usr/local/bin/pypy3
RUN mkdir -p /usr/local/pypy/pypy3.8 && \
curl -sqL https://downloads.python.org/pypy/pypy3.8-v7.3.11-linux64.tar.bz2 | tar xjf - -C /usr/local/pypy/pypy3.8 --strip-components=1 && \
ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3.8 && \
ln -sf /usr/local/pypy/pypy3.8/bin/pypy /usr/local/bin/pypy3

RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3

Expand Down
12 changes: 2 additions & 10 deletions python/pyspark/cloudpickle/compat.py
@@ -1,13 +1,5 @@
import sys


if sys.version_info < (3, 8):
try:
import pickle5 as pickle # noqa: F401
from pickle5 import Pickler # noqa: F401
except ImportError:
import pickle # noqa: F401
from pickle import _Pickler as Pickler # noqa: F401
else:
import pickle # noqa: F401
from _pickle import Pickler # noqa: F401
import pickle # noqa: F401
from pickle import Pickler # noqa: F401
3 changes: 2 additions & 1 deletion python/pyspark/sql/tests/test_dataframe.py
Expand Up @@ -1454,7 +1454,8 @@ def test_create_dataframe_from_pandas_with_dst(self):
os.environ["TZ"] = orig_env_tz
time.tzset()

@unittest.skipIf(not have_pandas, pandas_requirement_message) # type: ignore
# TODO(SPARK-43354): Re-enable test_create_dataframe_from_pandas_with_day_time_interval
@unittest.skip("Fails in PyPy Python 3.8, should enable.")
def test_create_dataframe_from_pandas_with_day_time_interval(self):
# SPARK-37277: Test DayTimeIntervalType in createDataFrame without Arrow.
import pandas as pd
Expand Down

0 comments on commit ca910a3

Please sign in to comment.