Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Docker Image on dockerhub #125094

Closed
lharri73 opened this issue Apr 27, 2024 · 6 comments
Closed

Broken Docker Image on dockerhub #125094

lharri73 opened this issue Apr 27, 2024 · 6 comments
Assignees
Labels
high priority module: docker module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@lharri73
Copy link

lharri73 commented Apr 27, 2024

馃悰 Describe the bug

It appears that the docker image on dockerhub for 2.3.0 cuda11.8 & cuda12.1, both runtime and devel are all malformed.

  1. They target arm64 instead of amd64 like all previous images
  2. The copy from /opt/conda only copied ~300Mb where it normaly copies several Gb.

Versions

N/A. Image will not run.

cc @ezyang @gchanan @zou3519 @kadeng

@malfet malfet added module: docker needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user high priority module: regression It used to work, and now it doesn't and removed needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels Apr 27, 2024
@malfet
Copy link
Contributor

malfet commented Apr 27, 2024

Hmm, indeed it is the case:

$ docker run --rm -it pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime bash
Unable to find image 'pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime' locally
2.3.0-cuda12.1-cudnn8-runtime: Pulling from pytorch/pytorch
Digest: sha256:cc14d1be87739710ca4e14c344e5d336b4dafde40df1a02cc5ac5c265301868c
Status: Downloaded newer image for pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64/v4) and no specific platform was requested

exec /usr/bin/bash: no such file or directory

and

$ docker run --rm -it pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel bash
Unable to find image 'pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel' locally
2.3.0-cuda12.1-cudnn8-devel: Pulling from pytorch/pytorch
Digest: sha256:0822df0b146549df1f487e30613e4aacf2976185587028866aa98701ea2e5ca8
Status: Downloaded newer image for pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64/v4) and no specific platform was requested
exec /opt/nvidia/nvidia_entrypoint.sh: no such file or directory

@atalman can you please fix it ASAP and let's try to figure out later how it happened?

[Edit] Most likely culprit is this guy: #115949

@janvdp
Copy link

janvdp commented Apr 28, 2024

Hi @lharri73, as a temporary workaround I'm using: "ghcr.io/pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime"

Maybe it helps...

@atalman
Copy link
Contributor

atalman commented Apr 29, 2024

@janvdp, @malfet the images in ghcr.io and pytorch/pytorch should be exactly the same here is the log:

pytorch/pytorch           2.3.0-cuda11.8-cudnn8-runtime   3578e171db9e   4 days ago     1.18GB
ghcr.io/pytorch/pytorch   2.3.0-cuda11.8-cudnn8-runtime   3578e171db9e   4 days ago     1.18GB
ghcr.io/pytorch/pytorch   2.3.0-cuda11.8-cudnn8-devel     d0edb1392485   4 days ago     9.05GB
pytorch/pytorch           2.3.0-cuda11.8-cudnn8-devel     d0edb1392485   4 days ago     9.05GB
ghcr.io/pytorch/pytorch   2.3.0-cuda12.1-cudnn8-runtime   994d45086c44   4 days ago     1.18GB
pytorch/pytorch           2.3.0-cuda12.1-cudnn8-runtime   994d45086c44   4 days ago     1.18GB
pytorch/pytorch           2.3.0-cuda12.1-cudnn8-devel     c270f91fbe3e   4 days ago     9.24GB
ghcr.io/pytorch/pytorch   2.3.0-cuda12.1-cudnn8-devel     c270f91fbe3e   4 days ago     9.24GB

Here is the validation workflow for these images:
https://github.com/pytorch/builder/actions/runs/8821461020/job/24217375234#step:11:57

Please note failure you see in the validation workflow is caused by this issue, still open:
#116696

Issue is due to ghcr.io/pytorch/pytorch contains both images arm64 and amd64:
Screenshot 2024-04-29 at 10 23 12鈥疉M

For release 2.3 I uploaded arm64 images. Will upload amd64 image now to fix this issue.

@atalman
Copy link
Contributor

atalman commented Apr 29, 2024

Amd64 images uploaded:

Screenshot 2024-04-29 at 11 21 32鈥疉M

@malfet
Copy link
Contributor

malfet commented Apr 29, 2024

@atalman just curious, what's inside 2.3.0-cuda11.8-cudnn8-runtime arm64 image? I assume now CUDA components are bundled with it, are there? In that case, why is there cuda-11.8 in its tag name?

@atalman
Copy link
Contributor

atalman commented Apr 29, 2024

@malfet Looks like this is an error. arm64 images should only be: ghcr.io/pytorch/pytorch:2.3.0-runtime

Since they build without CUDA support:

docker run --rm -it ghcr.io/pytorch/pytorch:2.3.0-cuda11.8-cudnn8-runtime bash
root@a46ecf7eefa2:/workspace# python
Python 3.10.14 (main, Mar 21 2024, 16:18:23) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
False
>>> 

Created an issue to fix this: pytorch/builder#1806

@atalman atalman added this to the 2.3.1 milestone Apr 29, 2024
@cpuhrsch cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: docker module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants