Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39522][INFRA]Uses Docker image cache over a custom image in pyspark job #37005

Closed
wants to merge 4 commits into from

Conversation

Yikun
Copy link
Member

@Yikun Yikun commented Jun 27, 2022

What changes were proposed in this pull request?

image

Change pyspark container from original static image to just-in-time build image from cache.

See also: https://docs.google.com/document/d/1_uiId-U1DODYyYZejAZeyz2OAjxcnA-xfwjynDF6vd0

This patch has 2 changes:

  1. Add a infra-image job to build ci image from cache images
  2. Use the ci image as pyspark job container image, only master branch enable the new workflow now.

Why are the changes needed?

Help to speed up docker infra image build in each PR.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI passed

dev/infra/Dockerfile Outdated Show resolved Hide resolved
HyukjinKwon pushed a commit that referenced this pull request Jul 5, 2022
… for container based job

### What changes were proposed in this pull request?
This patch add GITHUB_WORKSPACE to git trust safe.directory for container based job.

There are 3 container based job in Spark Infra:
- sparkr
- lint
- pyspark

### Why are the changes needed?

When upgrade `git >= 2.35.2` (such as latest dev docker image in my case), fix a [CVE-2022-24765](https://github.blog/2022-04-12-git-security-vulnerability-announced/#cve-2022-24765) , and it has been backported [latest ubuntu git](actions/checkout#760 (comment))

The `GITHUB_WORKSPACE` is belong to action user (in host), but the container user is `root` (in container) (root is required in our spark job), so cause the error like:
```
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
    git config --global --add safe.directory /__w/spark/spark
fatal: unsafe repository ('/__w/spark/spark' is owned by someone else)
To add an exception for this directory, call:
    git config --global --add safe.directory /__w/spark/spark
```

The solution is from `action/checkout` actions/checkout#760 , just add `GITHUB_WORKSPACE` to git trust safe.directory to make container user can checkout successfully.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed.
CI passed with latest ubuntu hosted runner, here is a simple e2e test for pyspark job #37005

Closes #37079 from Yikun/SPARK-39610.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@Yikun Yikun force-pushed the SPARK-39522-pyspark branch 3 times, most recently from 509e126 to d63d979 Compare July 6, 2022 01:37
@Yikun Yikun force-pushed the SPARK-39522-pyspark branch 7 times, most recently from 96db6b3 to 764923e Compare July 6, 2022 09:54
@Yikun
Copy link
Member Author

Yikun commented Jul 6, 2022

  • recompute is not blocked by the recovery *** FAILED *** (1 minute, 6 seconds)

Failure due to unrelated error, I will do a retrigger after #37103 merged.

@Yikun
Copy link
Member Author

Yikun commented Jul 7, 2022

Retriggered: https://github.com/Yikun/spark/runs/7225401037?check_suite_focus=true , infra image works well, let's see pyspark job. then will ready for review.

-X DELETE
-H "Accept: application/vnd.github+json"
-H "Authorization: token ${{ secrets.GHCR_DEL }}"
https://api.github.com/user/packages/container/${{ needs.infra-image.outputs.image_name }}
Copy link
Member Author

@Yikun Yikun Jul 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's:

So, I think apache-spark-ci-image:${{ inputs.branch }}-${{ github.run_id }} (diff tag for each pr) without any cleanup better than apache-spark-ci-image-${{ inputs.branch }}-${{ github.run_id }} (diff packages for each pr) by auto deletion.

If developers want to delete package for some reason, they can also easy delete complete image in github page.

@HyukjinKwon So, I decided to remove infra-image-post and back to use apache-spark-ci-image:tag without deletion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation.

@Yikun
Copy link
Member Author

Yikun commented Jul 7, 2022

cc @HyukjinKwon @dongjoon-hyun

Ready for review, Thanks!

@Yikun Yikun marked this pull request as ready for review July 7, 2022 15:07
uses: docker/login-action@v2
with:
registry: ghcr.io
username: ${{ github.actor }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: need a test for schedule job before ready to go

Copy link
Member Author

@Yikun Yikun Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case 1: The developer push a commit: ==> OK (CI passed!)

  • Login: github.actor: yikun
  • CI Image URL: yikun/apache-spark-ci-image:master-xxx

Case 2: Merge a commit in apache/spark master: ==> should OK
(committer need has right to push apache package) According to link it's inherited from repo right.

  • Login: ${{ github.actor }}: such as HyukjinKwon (who merge the commit)
  • CI Image URL: apache/apache-spark-ci-image:master-xxx
  • local test: login , image

Case 3: Scheduler job (master): ==> should OK
(The latest user who create the schedule trigger need has right to push apache package) same rights wtih case2

Case 4: Scheduler job (branch-xxx) ==> OK

  • Use branches image cache directly (dongjoon/apache-spark-github-action-image:20220207, future is spark/apache-spark-github-action-image-cache:branches-*)

Case 5: The developer push a commit in branches: ==> OK
Case 6: The apache/spark merge a commit in branches: ==> OK

  • use branches workflow, no change by this pr
  • When cut branches, will same as case 1 & case 2

@Yikun Yikun marked this pull request as draft July 8, 2022 05:01
@Yikun Yikun marked this pull request as ready for review July 8, 2022 06:53
@Yikun
Copy link
Member Author

Yikun commented Jul 8, 2022

Does all spark committers has right to push apache ghcr package (case 2, case 3)? According to cache image, I can confirm @HyukjinKwon has rights, but not sure any different with other commiters.

[1] https://docs.github.com/en/packages/learn-github-packages/configuring-a-packages-access-control-and-visibility#inheriting-access-for-a-container-image-from-a-repository

@HyukjinKwon
Copy link
Member

Yes, all other committers have the same permission with me.

@HyukjinKwon
Copy link
Member

Let's get this in and see how it gose.

@HyukjinKwon
Copy link
Member

Merged to master.

@Yikun
Copy link
Member Author

Yikun commented Jul 8, 2022

Let's get this in and see how it gose.

Sure, I will monitor in recent days. and update results here:

fetch-depth: 0
repository: apache/spark
ref: ${{ inputs.branch }}
- name: Sync the current branch with the latest in Apache Spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this step.

Copy link
Member Author

@Yikun Yikun Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a PR changes the dockerfile and merged, this will help to update the dockerfile in follow PR...so, I think it's necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh okay you're correct!

@@ -251,13 +251,73 @@ jobs:
name: unit-tests-log-${{ matrix.modules }}-${{ matrix.comment }}-${{ matrix.java }}-${{ matrix.hadoop }}-${{ matrix.hive }}
path: "**/target/unit-tests.log"

pyspark:
infra-image:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, let's add a name here to make it look pretty (e.g., Base image build)

@HyukjinKwon
Copy link
Member

Feel free to make a followup or a separate PR with a separate JIRA 👍

HyukjinKwon pushed a commit that referenced this pull request Jul 8, 2022
### What changes were proposed in this pull request?
Add name for infra-image and cleanup nits and fix the inline format

### Why are the changes needed?
Address comments in #37005 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #37131 from Yikun/SPARK-39718.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants