-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC][SPARK-39522][INFRA] Uses Docker image cache over a custom image #36980
Conversation
78f48de
to
26640b6
Compare
This is amazing! Thanks Yikun! |
Cc @dongjoon-hyun ! |
Thank you for pinging me, @HyukjinKwon . |
@HyukjinKwon @dongjoon-hyun Thanks! This is still WIP, I will mark it ready for review after it's ready! |
49410e2
to
0039fec
Compare
FYI, the master branch linter job is broken and I'm fixing it here. |
@dongjoon-hyun Thanks! |
Information sync: From the latest build result: https://github.com/Yikun/spark/runs/7051222363?check_suite_focus=true#step:7:127 , the cache works. SPARK-39609 1. ModuleNotFoundError: No module named '_pickle'
Build latest dockerfile pypy3 upgrade to 3.8 (original is 3.7), but it seems cloudpickle has a bug. This may related: cloudpipe/cloudpickle@8bbea3e , but I try to apply this, also failed. Need a deeper look, if you guys know the reason of this, pls let me know. SPARK-39610 2. fatal: unsafe repository
https://github.blog/2022-04-12-git-security-vulnerability-announced/ I do a quick fix, need submit a separate PR to address it. - name: Github Actions permissions workaround
run: |
git config --global --add safe.directory ${GITHUB_WORKSPACE} SPARK-39611 3. lint python
due to SPARK-39611 4 NotImplementedError: pandas-on-Spark objects currently do not support
due to 5. R lint error
intall 6. sparkr
intall
8. pypy3 v7.3.9 core dump
|
SO ImportError: No module named _pickle and cloudpickle should import Pickler class directly from the pickle module
|
@bjornjorgensen Thanks, yep, as I mentioned:
Let's first pin to pypy3.7 |
8133043
to
75320ff
Compare
.github/workflows/build_and_test.yml
Outdated
if: fromJson(needs.precondition.outputs.required).pyspark == 'true' | ||
name: "Build modules: ${{ matrix.modules }}" | ||
runs-on: ubuntu-20.04 | ||
container: | ||
image: dongjoon/apache-spark-github-action-image:20220207 | ||
image: ghcr.io/${{ needs.precondition.outputs.user }}/apache-spark-github-action-image:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could use options: --user ${{ needs.preconditions.outputs.os_user }}
to avoid the steps for Github Actions permissions workaround
later.
where os_user
is defined earlier as:
echo ::set-output name=os_user::$(id -u)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like original job didn't use user, so maybe just keep same first.
Done, see more in: https://lists.apache.org/thread/3c02qg9p057ombz0vlohrgckfxlsqm8n |
What changes were proposed in this pull request?
See more in: https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0
Why are the changes needed?
The build job are running in each user's downstream repo, so we have to use "Registry cache" as a bridge.
The complete flow would be:
Does this PR introduce any user-facing change?
No, infra only
How was this patch tested?
CI passed
Co-authed-by: @dongjoon-hyun @HyukjinKwon (copy the original dockerfile as initial dockerfile : ))