[POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image #36980

Yikun · 2022-06-24T16:20:26Z

What changes were proposed in this pull request?

.github/workflows/build_infra_images_cache.yml: to build the docker image cache, this cache will be used in infra-image job
.github/workflows/build_and_test.yml infra-image job will built in each developer repo according latest dockerfile, use the infra_images_cache to speed up.
dev/infra/Dockerfile put the frequetly changes dockerfile line to end to make the most of the cache.

See more in: https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0

Why are the changes needed?

The build job are running in each user's downstream repo, so we have to use "Registry cache" as a bridge.
The complete flow would be:

(apache repo) Build the image cache in apache repo, this image will be refreshed if dockerfile changes merged.
(user repo) Build the latest infra image in each pr based on image cache and PR changes Dockerfile, and upload it to user gchr.io.
(user repo) Use the latest infra image of Step2 to running pyspark, sparkr, lint.

Does this PR introduce any user-facing change?

No, infra only

How was this patch tested?

CI passed

Co-authed-by: @dongjoon-hyun @HyukjinKwon (copy the original dockerfile as initial dockerfile : ))

HyukjinKwon · 2022-06-24T18:27:19Z

This is amazing! Thanks Yikun!

HyukjinKwon · 2022-06-24T18:27:38Z

Cc @dongjoon-hyun !

dongjoon-hyun · 2022-06-24T19:10:00Z

Thank you for pinging me, @HyukjinKwon .

Yikun · 2022-06-25T01:18:50Z

@HyukjinKwon @dongjoon-hyun Thanks! This is still WIP, I will mark it ready for review after it's ready!

dongjoon-hyun · 2022-06-25T07:13:42Z

FYI, the master branch linter job is broken and I'm fixing it here.

[SPARK-39596][INFRA] Install ggplot2 for GitHub Action linter job #36987

Yikun · 2022-06-25T12:37:15Z

@dongjoon-hyun Thanks!

Yikun · 2022-06-25T13:17:10Z

Information sync: From the latest build result: https://github.com/Yikun/spark/runs/7051222363?check_suite_focus=true#step:7:127 , the cache works.
But CI failed due to docker image updated (ubuntu from 20.04.3 ==> 20.04.4, pypy3.7 ==> pypy3.8, and serveral python pkg upgrade like numpy):

SPARK-39609 1. ModuleNotFoundError: No module named '_pickle'

Starting test(pypy3): pyspark.sql.tests.test_arrow (temp output: /tmp/pypy3__pyspark.sql.tests.test_arrow__jx96qdzs.log)
Traceback (most recent call last):
  File "/usr/lib/pypy3.8/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/pypy3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/__w/spark/spark/python/pyspark/__init__.py", line 59, in <module>
    from pyspark.rdd import RDD, RDDBarrier
  File "/__w/spark/spark/python/pyspark/rdd.py", line 54, in <module>
    from pyspark.java_gateway import local_connect_and_auth
  File "/__w/spark/spark/python/pyspark/java_gateway.py", line 32, in <module>
    from pyspark.serializers import read_int, write_with_length, UTF8Deserializer
  File "/__w/spark/spark/python/pyspark/serializers.py", line 68, in <module>
    from pyspark import cloudpickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/__init__.py", line 4, in <module>
    from pyspark.cloudpickle.cloudpickle import *  # noqa
  File "/__w/spark/spark/python/pyspark/cloudpickle/cloudpickle.py", line 57, in <module>
    from .compat import pickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/compat.py", line 13, in <module>
    from _pickle import Pickler  # noqa: F401
ModuleNotFoundError: No module named '_pickle'
Had test failures in pyspark.sql.tests.test_arrow with pypy3; see logs.

Build latest dockerfile pypy3 upgrade to 3.8 (original is 3.7), but it seems cloudpickle has a bug. This may related: cloudpipe/cloudpickle@8bbea3e , but I try to apply this, also failed. Need a deeper look, if you guys know the reason of this, pls let me know.

SPARK-39610 2. fatal: unsafe repository

fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
	git config --global --add safe.directory /__w/spark/spark
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
	git config --global --add safe.directory /__w/spark/spark
Error: Process completed with exit code 128.

https://github.blog/2022-04-12-git-security-vulnerability-announced/
actions/checkout#760

I do a quick fix, need submit a separate PR to address it.

    - name: Github Actions permissions workaround
      run: |
        git config --global --add safe.directory ${GITHUB_WORKSPACE}

SPARK-39611 3. lint python

starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/pandas/frame.py:9970: error: Need type annotation for "raveled_column_labels"  [var-annotated]
Found 1 error in 1 file (checked 337 source files)

due to numpy upgrade, we could let numpy<=1.22.2 first.

SPARK-39611 4 NotImplementedError: pandas-on-Spark objects currently do not support

======================================================================
ERROR [2.102s]: test_arithmetic_op_exceptions (pyspark.pandas.tests.test_series_datetime.SeriesDateTimeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", line 99, in test_arithmetic_op_exceptions
    self.assertRaisesRegex(TypeError, expected_err_msg, lambda: other / psser)
  File "/usr/lib/python3.9/unittest/case.py", line 1276, in assertRaisesRegex
    return context.handle('assertRaisesRegex', args, kwargs)
  File "/usr/lib/python3.9/unittest/case.py", line 201, in handle
    callable_obj(*args, **kwargs)
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", line 99, in <lambda>
    self.assertRaisesRegex(TypeError, expected_err_msg, lambda: other / psser)
  File "/__w/spark/spark/python/pyspark/pandas/base.py", line 465, in __array_ufunc__
    raise NotImplementedError(
NotImplementedError: pandas-on-Spark objects currently do not support <ufunc 'divide'>.
----------------------------------------------------------------------

due to numpy upgrade, we could let numpy<=1.22.2 first.

5. R lint error

Loading required namespace: SparkR
Loading required namespace: lintr
Failed with error:  ‘there is no package called ‘lintr’’
Installing package into ‘/usr/lib/R/site-library’
(as ‘lib’ is unspecified)
Error in contrib.url(repos, type) : 
  trying to use CRAN without setting a mirror
Calls: install.packages -> startsWith -> contrib.url
Execution halted

intall lintr?
https://github.com/Yikun/spark/runs/7052215049?check_suite_focus=true

6. sparkr

Loading required namespace: SparkR
Loading required namespace: lintr
Failed with error:  ‘there is no package called ‘lintr’’
Installing package into ‘/usr/lib/R/site-library’
(as ‘lib’ is unspecified)
Error in contrib.url(repos, type) : 
  trying to use CRAN without setting a mirror
Calls: install.packages -> startsWith -> contrib.url
Execution halted

intall lintr?
https://github.com/Yikun/spark/runs/7052215214?check_suite_focus=true#step:9:10200

sparkr arrow related case failed:
https://github.com/Yikun/spark/runs/7043826939?check_suite_focus=true#step:9:10904
no idea

8. pypy3 v7.3.9 core dump

root@yikun-x86:~/spark# python/run-tests.py --testnames 'pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst'
Running PySpark tests. Output is in /root/spark/python/unit-tests.log
Will test against the following Python executables: ['pypy3']
Will test the following Python tests: ['pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst']
pypy3 python_implementation is PyPy
pypy3 version is: Python 3.7.13 (7e0ae751533460d5f89f3ac48ce366d8642d1db5, Mar 29 2022, 06:03:31)
[PyPy 7.3.9 with GCC 10.2.1 20210130 (Red Hat 10.2.1-11)]
Starting test(pypy3): pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst (temp output: /tmp/pypy3__pyspark.sql.tests.test_dataframe_DataFrameTests.test_create_dataframe_from_pandas_with_dst__5lb00c_c.log)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
test_create_dataframe_from_pandas_with_dst (pyspark.sql.tests.test_dataframe.DataFrameTests) ...
Had test failures in pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst with pypy3; see logs.

(pypy3) root@yikun-x86:~/spark# SPARK_TESTING=1 /root/spark/bin/pyspark pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
test_create_dataframe_from_pandas_with_dst (pyspark.sql.tests.test_dataframe.DataFrameTests) ... Segmentation fault (core dumped)

bjornjorgensen · 2022-06-25T20:45:14Z

ModuleNotFoundError: No module named '_pickle'

SO ImportError: No module named _pickle and cloudpickle should import Pickler class directly from the pickle module

spark/python/pyspark/cloudpickle/compat.py

Line 13 in f465a3d

from _pickle import Pickler # noqa: F401

Yikun · 2022-06-26T01:31:45Z

@bjornjorgensen Thanks, yep, as I mentioned:

This may related: cloudpipe/cloudpickle@8bbea3e , but I try to apply this, also failed.

Let's first pin to pypy3.7

martin-g · 2022-06-27T08:59:29Z

.github/workflows/build_and_test.yml

    if: fromJson(needs.precondition.outputs.required).pyspark == 'true'
    name: "Build modules: ${{ matrix.modules }}"
    runs-on: ubuntu-20.04
    container:
-      image: dongjoon/apache-spark-github-action-image:20220207
+      image: ghcr.io/${{ needs.precondition.outputs.user }}/apache-spark-github-action-image:latest


I think you could use options: --user ${{ needs.preconditions.outputs.os_user }} to avoid the steps for Github Actions permissions workaround later.

where os_user is defined earlier as:
echo ::set-output name=os_user::$(id -u)

Looks like original job didn't use user, so maybe just keep same first.

Yikun · 2022-06-27T14:26:05Z

In order easy to review and complete step by step, I splited the PR to:
Step 1: #37003
Step 2: #37005
Step 3: #37006
Step 4: TBD

Yikun · 2022-07-21T11:42:52Z

Done, see more in: https://lists.apache.org/thread/3c02qg9p057ombz0vlohrgckfxlsqm8n

github-actions bot added BUILD INFRA labels Jun 24, 2022

Yikun force-pushed the SPARK-39522 branch 4 times, most recently from 78f48de to 26640b6 Compare June 24, 2022 16:48

Yikun force-pushed the SPARK-39522 branch 3 times, most recently from 49410e2 to 0039fec Compare June 25, 2022 01:52

Yikun force-pushed the SPARK-39522 branch from 0039fec to 12fa6df Compare June 25, 2022 13:26

Yikun force-pushed the SPARK-39522 branch 2 times, most recently from 8133043 to 75320ff Compare June 26, 2022 10:39

Yikun changed the title ~~[WIP][SPARK-39522][INFRA] Uses Docker image cache over a custom image~~ [POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image Jun 27, 2022

martin-g reviewed Jun 27, 2022

View reviewed changes

Uses Docker image cache over a custom image in pyspark job

3076351

Yikun force-pushed the SPARK-39522 branch from 7609df2 to 3076351 Compare July 5, 2022 11:52

Yikun closed this Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image #36980

[POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image #36980

Yikun commented Jun 24, 2022 •

edited

HyukjinKwon commented Jun 24, 2022

HyukjinKwon commented Jun 24, 2022

dongjoon-hyun commented Jun 24, 2022

Yikun commented Jun 25, 2022

dongjoon-hyun commented Jun 25, 2022 •

edited

Yikun commented Jun 25, 2022

Yikun commented Jun 25, 2022 •

edited

bjornjorgensen commented Jun 25, 2022 •

edited

Yikun commented Jun 26, 2022 •

edited

martin-g Jun 27, 2022

Yikun Jun 27, 2022

Yikun commented Jun 27, 2022

Yikun commented Jul 21, 2022

[POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image #36980

[POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image #36980

Conversation

Yikun commented Jun 24, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Jun 24, 2022

HyukjinKwon commented Jun 24, 2022

dongjoon-hyun commented Jun 24, 2022

Yikun commented Jun 25, 2022

dongjoon-hyun commented Jun 25, 2022 • edited

Yikun commented Jun 25, 2022

Yikun commented Jun 25, 2022 • edited

bjornjorgensen commented Jun 25, 2022 • edited

Yikun commented Jun 26, 2022 • edited

martin-g Jun 27, 2022

Choose a reason for hiding this comment

Yikun Jun 27, 2022

Choose a reason for hiding this comment

Yikun commented Jun 27, 2022

Yikun commented Jul 21, 2022

Yikun commented Jun 24, 2022 •

edited

dongjoon-hyun commented Jun 25, 2022 •

edited

Yikun commented Jun 25, 2022 •

edited

bjornjorgensen commented Jun 25, 2022 •

edited

Yikun commented Jun 26, 2022 •

edited