Separate arm64 and amd64 docker builds #125617

atalman · 2024-05-06T18:46:17Z

Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab:

docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found

https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617

Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225

Tracked on our side: pytorch/builder#1811

pytorch-bot · 2024-05-06T18:46:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125617

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 4 Unrelated Failures

As of commit fd3c207 with merge base bb668c6 ():

NEW FAILURES - The following jobs have failed:

Build Official Docker Images / build (12.4, 12.4.0, 8, devel, linux/amd64) (gh)
Process completed with exit code 2.
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu) (gh)
test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_complex128
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_pad_mm.py::PadMMTest::test_cat_pad_mm_dyn_m

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_batch_norm_2d_2_dynamic_shapes_cuda
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_adaptive_avg_pool2d1_dynamic_shapes_cuda
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sdpa_unaligned_mask_dynamic_shapes_cpu

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d6_dynamic_shapes_cuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

atalman · 2024-05-06T19:59:25Z

.github/workflows/docker-release.yml

+          docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
+
+          # Please note, here we ned to pin specific verison of CUDA as with latest label
+          if[[${CUDA_VERSION_SHORT} == "12.1"]]; then


Do we need latest label at all here ? Maybe we can simply remove it. I see following stats:
https://github.com/orgs/pytorch/packages/container/pytorch-nightly/212425390?tag=latest

Download activity Total downloads 0 Last 30 days 0 Last week 0 Today 0

huydhn

Plz fix lint before landing

atalman · 2024-05-07T11:47:55Z

@pytorchmergebot merge -f "Failures are unrelated"

pytorchmergebot · 2024-05-07T11:50:36Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

As a followup after: pytorch/pytorch#125617 Add validation_runner output param to know what validation runner to use Test: ``` python tools/scripts/generate_docker_release_matrix.py {"include": [{"cuda": "11.8", "cuda_full_version": "11.8.0", "cudnn_version": "8", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda11.8-cudnn8-runtime", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "11.8", "cuda_full_version": "11.8.0", "cudnn_version": "8", "image_type": "devel", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda11.8-cudnn8-devel", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.1", "cuda_full_version": "12.1.1", "cudnn_version": "8", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.1-cudnn8-runtime", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.1", "cuda_full_version": "12.1.1", "cudnn_version": "8", "image_type": "devel", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.1-cudnn8-devel", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.4", "cuda_full_version": "12.4.0", "cudnn_version": "8", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.4-cudnn8-runtime", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "12.4", "cuda_full_version": "12.4.0", "cudnn_version": "8", "image_type": "devel", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-cuda12.4-cudnn8-devel", "platform": "linux/amd64", "validation_runner": "linux.g5.4xlarge.nvidia.gpu"}, {"cuda": "cpu", "cuda_full_version": "", "cudnn_version": "", "image_type": "runtime", "docker": "ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240507-runtime", "platform": "linux/arm64", "validation_runner": "linux.arm64.2xlarge"}]} ```

atalman · 2024-05-13T20:18:34Z

@pytorchbot cherry-pick --onto release/2.3 -c critical

Fixes #125094 Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab: ``` docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found ``` https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617 Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225 Tracked on our side: pytorch/builder#1811 Pull Request resolved: #125617 Approved by: https://github.com/huydhn, https://github.com/malfet (cherry picked from commit b29d77b)

pytorchbot · 2024-05-13T20:22:19Z

Cherry picking #125617

The cherry pick PR is at #126099 and it is recommended to link a critical cherry pick PR with an issue

Details for Dev Infra team

Raised by workflow job

Separate arm64 and amd64 docker builds (#125617) Fixes #125094 Please note: Docker CUDa 12.4 failure is existing issue, related to docker image not being available on gitlab: ``` docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: docker.io/nvidia/cuda:12.4.0-cudnn8-devel-ubuntu22.04: not found ``` https://github.com/pytorch/pytorch/actions/runs/8974959068/job/24648540236?pr=125617 Here is the reference issue: https://gitlab.com/nvidia/container-images/cuda/-/issues/225 Tracked on our side: pytorch/builder#1811 Pull Request resolved: #125617 Approved by: https://github.com/huydhn, https://github.com/malfet (cherry picked from commit b29d77b) Co-authored-by: atalman <atalman@fb.com>

Separate arm64 and amd64 docker builds

845e3e2

atalman requested a review from a team as a code owner May 6, 2024 18:46

pytorch-bot bot added the topic: not user facing topic category label May 6, 2024

atalman added 5 commits May 6, 2024 11:48

test

04629a0

fix

cf4948e

fix

362006b

fix

8b19d5e

test

7591b5c

atalman commented May 6, 2024

View reviewed changes

lint

74b2f24

huydhn approved these changes May 6, 2024

View reviewed changes

lint

fd3c207

malfet approved these changes May 6, 2024

View reviewed changes

polarathene mentioned this pull request May 7, 2024

Dockerfile should set the syntax directive to v1 #125632

Closed

pytorchmergebot added the merging label May 7, 2024

pytorchmergebot added the Merged label May 7, 2024

pytorchmergebot closed this in b29d77b May 7, 2024

pytorchmergebot removed the merging label May 7, 2024

This was referenced May 7, 2024

Docker Images Validate. Fix arm64 docker builds to not contain cuda versions pytorch/builder#1806

Open

Separate arm64 docker builds from amd64 pytorch/test-infra#5184

Merged

atalman mentioned this pull request May 13, 2024

[v2.3.1] Release Tracker #125425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate arm64 and amd64 docker builds #125617

Separate arm64 and amd64 docker builds #125617

atalman commented May 6, 2024 •

edited

pytorch-bot bot commented May 6, 2024 •

edited

atalman May 6, 2024 •

edited

huydhn left a comment

atalman commented May 7, 2024

pytorchmergebot commented May 7, 2024

atalman commented May 13, 2024

pytorchbot commented May 13, 2024

Separate arm64 and amd64 docker builds #125617

Separate arm64 and amd64 docker builds #125617

Conversation

atalman commented May 6, 2024 • edited

pytorch-bot bot commented May 6, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125617

❌ 3 New Failures, 4 Unrelated Failures

atalman May 6, 2024 • edited

Choose a reason for hiding this comment

huydhn left a comment

Choose a reason for hiding this comment

atalman commented May 7, 2024

pytorchmergebot commented May 7, 2024

Merge started

atalman commented May 13, 2024

pytorchbot commented May 13, 2024

Cherry picking #125617

atalman commented May 6, 2024 •

edited

pytorch-bot bot commented May 6, 2024 •

edited

atalman May 6, 2024 •

edited