Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate arm64 and amd64 docker builds #125617

Closed
wants to merge 8 commits into from

Conversation

atalman
Copy link
Contributor

@atalman atalman commented May 6, 2024

Fixes #125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:

docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found

https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: pytorch/builder#1811

@atalman atalman requested a review from a team as a code owner May 6, 2024 18:46
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 6, 2024
Copy link

pytorch-bot bot commented May 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125617

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 4 Unrelated Failures

As of commit fd3c207 with merge base bb668c6 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"

# Please note, here we ned to pin specific verison of CUDA as with latest label
if[[${CUDA_VERSION_SHORT} == "12.1"]]; then
Copy link
Contributor Author

@atalman atalman May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need latest label at all here ? Maybe we can simply remove it. I see following stats:
https://github.com/orgs/pytorch/packages/container/pytorch-nightly/212425390?tag=latest

Download activity
Total downloads
0
Last 30 days
0
Last week
0
Today
0

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz fix lint before landing

@atalman
Copy link
Contributor Author

atalman commented May 7, 2024

@pytorchmergebot merge -f "Failures are unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

atalman added a commit to pytorch/test-infra that referenced this pull request May 7, 2024
As a followup after: pytorch/pytorch#125617
Add validation_runner output param to know what validation runner to use
Test:
```
python tools/scripts/generate_docker_release_matrix.py
{"include": [{"cuda": "11.8", "cuda_full_version": "11.8.0", "cudnn_version": "8", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda11.8-cudnn8-runtime", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "11.8", "cuda_full_version": "11.8.0", "cudnn_version": "8", "image_type": "devel", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda11.8-cudnn8-devel", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.1", "cuda_full_version": "12.1.1", "cudnn_version": "8", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.1-cudnn8-runtime", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.1", "cuda_full_version": "12.1.1", "cudnn_version": "8", "image_type": "devel", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.1-cudnn8-devel", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.4", "cuda_full_version": "12.4.0", "cudnn_version": "8", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.4-cudnn8-runtime", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.4", "cuda_full_version": "12.4.0", "cudnn_version": "8", "image_type": "devel", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.4-cudnn8-devel", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "cpu", "cuda_full_version": "", "cudnn_version": "", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-runtime", "platform": "linux/arm64", "validation_runner": "linux.arm64.2xlarge"}]}
```
@atalman
Copy link
Contributor Author

atalman commented May 13, 2024

@pytorchbot cherry-pick --onto release/2.3 -c critical

pytorchbot pushed a commit that referenced this pull request May 13, 2024
Fixes #125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:
```
docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found
```
 https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: pytorch/builder#1811
Pull Request resolved: #125617
Approved by: https://github.com/huydhn, https://github.com/malfet

(cherry picked from commit b29d77b)
@pytorchbot
Copy link
Collaborator

Cherry picking #125617

The cherry pick PR is at #126099 and it is recommended to link a critical cherry pick PR with an issue

Details for Dev Infra team Raised by workflow job

huydhn pushed a commit that referenced this pull request May 13, 2024
Separate arm64 and amd64 docker builds (#125617)

Fixes #125094

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:
```
docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found
```
 https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: pytorch/builder#1811
Pull Request resolved: #125617
Approved by: https://github.com/huydhn, https://github.com/malfet

(cherry picked from commit b29d77b)

Co-authored-by: atalman <atalman@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken Docker Image on dockerhub
5 participants