Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image #36980

Closed
wants to merge 1 commit into from

Conversation

Yikun
Copy link
Member

@Yikun Yikun commented Jun 24, 2022

What changes were proposed in this pull request?

image

  • .github/workflows/build_infra_images_cache.yml: to build the docker image cache, this cache will be used in infra-image job
  • .github/workflows/build_and_test.yml infra-image job will built in each developer repo according latest dockerfile, use the infra_images_cache to speed up.
  • dev/infra/Dockerfile put the frequetly changes dockerfile line to end to make the most of the cache.

See more in: https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0

Why are the changes needed?

The build job are running in each user's downstream repo, so we have to use "Registry cache" as a bridge.
The complete flow would be:

  • (apache repo) Build the image cache in apache repo, this image will be refreshed if dockerfile changes merged.
  • (user repo) Build the latest infra image in each pr based on image cache and PR changes Dockerfile, and upload it to user gchr.io.
  • (user repo) Use the latest infra image of Step2 to running pyspark, sparkr, lint.

Does this PR introduce any user-facing change?

No, infra only

How was this patch tested?

CI passed

Co-authed-by: @dongjoon-hyun @HyukjinKwon (copy the original dockerfile as initial dockerfile : ))

@Yikun Yikun force-pushed the SPARK-39522 branch 4 times, most recently from 78f48de to 26640b6 Compare June 24, 2022 16:48
@HyukjinKwon
Copy link
Member

This is amazing! Thanks Yikun!

@HyukjinKwon
Copy link
Member

Cc @dongjoon-hyun !

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @HyukjinKwon .

@Yikun
Copy link
Member Author

Yikun commented Jun 25, 2022

@HyukjinKwon @dongjoon-hyun Thanks! This is still WIP, I will mark it ready for review after it's ready!

@Yikun Yikun force-pushed the SPARK-39522 branch 3 times, most recently from 49410e2 to 0039fec Compare June 25, 2022 01:52
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 25, 2022

FYI, the master branch linter job is broken and I'm fixing it here.

@Yikun
Copy link
Member Author

Yikun commented Jun 25, 2022

@dongjoon-hyun Thanks!

@Yikun
Copy link
Member Author

Yikun commented Jun 25, 2022

Information sync: From the latest build result: https://github.com/Yikun/spark/runs/7051222363?check_suite_focus=true#step:7:127 , the cache works.
But CI failed due to docker image updated (ubuntu from 20.04.3 ==> 20.04.4, pypy3.7 ==> pypy3.8, and serveral python pkg upgrade like numpy):

SPARK-39609 1. ModuleNotFoundError: No module named '_pickle'
Starting test(pypy3): pyspark.sql.tests.test_arrow (temp output: /tmp/pypy3__pyspark.sql.tests.test_arrow__jx96qdzs.log)
Traceback (most recent call last):
  File "/usr/lib/pypy3.8/runpy.py", line 188, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/pypy3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/__w/spark/spark/python/pyspark/__init__.py", line 59, in <module>
    from pyspark.rdd import RDD, RDDBarrier
  File "/__w/spark/spark/python/pyspark/rdd.py", line 54, in <module>
    from pyspark.java_gateway import local_connect_and_auth
  File "/__w/spark/spark/python/pyspark/java_gateway.py", line 32, in <module>
    from pyspark.serializers import read_int, write_with_length, UTF8Deserializer
  File "/__w/spark/spark/python/pyspark/serializers.py", line 68, in <module>
    from pyspark import cloudpickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/__init__.py", line 4, in <module>
    from pyspark.cloudpickle.cloudpickle import *  # noqa
  File "/__w/spark/spark/python/pyspark/cloudpickle/cloudpickle.py", line 57, in <module>
    from .compat import pickle
  File "/__w/spark/spark/python/pyspark/cloudpickle/compat.py", line 13, in <module>
    from _pickle import Pickler  # noqa: F401
ModuleNotFoundError: No module named '_pickle'
Had test failures in pyspark.sql.tests.test_arrow with pypy3; see logs.

Build latest dockerfile pypy3 upgrade to 3.8 (original is 3.7), but it seems cloudpickle has a bug. This may related: cloudpipe/cloudpickle@8bbea3e , but I try to apply this, also failed. Need a deeper look, if you guys know the reason of this, pls let me know.

SPARK-39610 2. fatal: unsafe repository
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
	git config --global --add safe.directory /__w/spark/spark
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
	git config --global --add safe.directory /__w/spark/spark
Error: Process completed with exit code 128.

https://github.blog/2022-04-12-git-security-vulnerability-announced/
actions/checkout#760

I do a quick fix, need submit a separate PR to address it.

    - name: Github Actions permissions workaround
      run: |
        git config --global --add safe.directory ${GITHUB_WORKSPACE}
SPARK-39611 3. lint python
starting mypy annotations test...
annotations failed mypy checks:
python/pyspark/pandas/frame.py:9970: error: Need type annotation for "raveled_column_labels"  [var-annotated]
Found 1 error in 1 file (checked 337 source files)

due to numpy upgrade, we could let numpy<=1.22.2 first.

SPARK-39611 4 NotImplementedError: pandas-on-Spark objects currently do not support
======================================================================
ERROR [2.102s]: test_arithmetic_op_exceptions (pyspark.pandas.tests.test_series_datetime.SeriesDateTimeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", line 99, in test_arithmetic_op_exceptions
    self.assertRaisesRegex(TypeError, expected_err_msg, lambda: other / psser)
  File "/usr/lib/python3.9/unittest/case.py", line 1276, in assertRaisesRegex
    return context.handle('assertRaisesRegex', args, kwargs)
  File "/usr/lib/python3.9/unittest/case.py", line 201, in handle
    callable_obj(*args, **kwargs)
  File "/__w/spark/spark/python/pyspark/pandas/tests/test_series_datetime.py", line 99, in <lambda>
    self.assertRaisesRegex(TypeError, expected_err_msg, lambda: other / psser)
  File "/__w/spark/spark/python/pyspark/pandas/base.py", line 465, in __array_ufunc__
    raise NotImplementedError(
NotImplementedError: pandas-on-Spark objects currently do not support <ufunc 'divide'>.
----------------------------------------------------------------------

due to numpy upgrade, we could let numpy<=1.22.2 first.

5. R lint error
Loading required namespace: SparkR
Loading required namespace: lintr
Failed with error:  ‘there is no package called ‘lintr’’
Installing package into ‘/usr/lib/R/site-library’
(as ‘lib’ is unspecified)
Error in contrib.url(repos, type) : 
  trying to use CRAN without setting a mirror
Calls: install.packages -> startsWith -> contrib.url
Execution halted

intall lintr?
https://github.com/Yikun/spark/runs/7052215049?check_suite_focus=true

6. sparkr
Loading required namespace: SparkR
Loading required namespace: lintr
Failed with error:  ‘there is no package called ‘lintr’’
Installing package into ‘/usr/lib/R/site-library’
(as ‘lib’ is unspecified)
Error in contrib.url(repos, type) : 
  trying to use CRAN without setting a mirror
Calls: install.packages -> startsWith -> contrib.url
Execution halted

intall lintr?
https://github.com/Yikun/spark/runs/7052215214?check_suite_focus=true#step:9:10200

  1. sparkr arrow related case failed:
    https://github.com/Yikun/spark/runs/7043826939?check_suite_focus=true#step:9:10904
    no idea
8. pypy3 v7.3.9 core dump
root@yikun-x86:~/spark# python/run-tests.py --testnames 'pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst'
Running PySpark tests. Output is in /root/spark/python/unit-tests.log
Will test against the following Python executables: ['pypy3']
Will test the following Python tests: ['pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst']
pypy3 python_implementation is PyPy
pypy3 version is: Python 3.7.13 (7e0ae751533460d5f89f3ac48ce366d8642d1db5, Mar 29 2022, 06:03:31)
[PyPy 7.3.9 with GCC 10.2.1 20210130 (Red Hat 10.2.1-11)]
Starting test(pypy3): pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst (temp output: /tmp/pypy3__pyspark.sql.tests.test_dataframe_DataFrameTests.test_create_dataframe_from_pandas_with_dst__5lb00c_c.log)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
test_create_dataframe_from_pandas_with_dst (pyspark.sql.tests.test_dataframe.DataFrameTests) ...
Had test failures in pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst with pypy3; see logs.

(pypy3) root@yikun-x86:~/spark# SPARK_TESTING=1 /root/spark/bin/pyspark pyspark.sql.tests.test_dataframe DataFrameTests.test_create_dataframe_from_pandas_with_dst
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
test_create_dataframe_from_pandas_with_dst (pyspark.sql.tests.test_dataframe.DataFrameTests) ... Segmentation fault (core dumped)

@bjornjorgensen
Copy link
Contributor

bjornjorgensen commented Jun 25, 2022

  1. ModuleNotFoundError: No module named '_pickle'

SO ImportError: No module named _pickle and cloudpickle should import Pickler class directly from the pickle module

from _pickle import Pickler # noqa: F401

@Yikun
Copy link
Member Author

Yikun commented Jun 26, 2022

@bjornjorgensen Thanks, yep, as I mentioned:

This may related: cloudpipe/cloudpickle@8bbea3e , but I try to apply this, also failed.

Let's first pin to pypy3.7

@Yikun Yikun force-pushed the SPARK-39522 branch 2 times, most recently from 8133043 to 75320ff Compare June 26, 2022 10:39
@Yikun Yikun changed the title [WIP][SPARK-39522][INFRA] Uses Docker image cache over a custom image [POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image Jun 27, 2022
if: fromJson(needs.precondition.outputs.required).pyspark == 'true'
name: "Build modules: ${{ matrix.modules }}"
runs-on: ubuntu-20.04
container:
image: dongjoon/apache-spark-github-action-image:20220207
image: ghcr.io/${{ needs.precondition.outputs.user }}/apache-spark-github-action-image:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could use options: --user ${{ needs.preconditions.outputs.os_user }} to avoid the steps for Github Actions permissions workaround later.

where os_user is defined earlier as:
echo ::set-output name=os_user::$(id -u)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like original job didn't use user, so maybe just keep same first.

@Yikun
Copy link
Member Author

Yikun commented Jun 27, 2022

In order easy to review and complete step by step, I splited the PR to:
Step 1: #37003
Step 2: #37005
Step 3: #37006
Step 4: TBD

@Yikun
Copy link
Member Author

Yikun commented Jul 21, 2022

@Yikun Yikun closed this Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants