-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP2 INTERNAL_ERROR when pulling large image from NGC with CE 4.1.x #2854
Comments
It appears that this is due to an issue with the golang/go-containerregistry dependency that 4.x is using to obtain images. There is a problem related to HTTP2 flow control, that manifests with newer golang/x/net - the same issue has been seen elsewhere: googleapis/google-cloud-go#7440 (comment) Probably only triggered by the very large NGC images, as smaller images don't involve a lot of streams / hitting timeouts for stuck streams. As a workaround please could you try disabling HTTP2 using the
This should force it to use HTTP1, which hopefully works. We will have to follow up on the issue with the developers at Google who produce the go-containerregistry module. |
Need to check our code too... is there anywhere we are holding open a reader that we receive from ggcr? |
Thank you very much for the work around! FATAL: While performing build: packer failed to pack: while unpacking tmpfs: error unpacking rootfs: unpack entry: opt/conda/pkgs/black-22.1.0-pyhd8ed1ab_0/site-packages/black-22.1.0.dist-info/AUTHOR S.md: link: no such file or directory it appears as the latest singularity version does not give a lot of build details as older versions from the output. below are the full output for both the use case. |
@ashwinidr23 - to confirm... disabling http2 works to stop the error during download? If so, I'm glad and thank you for trying that. On RHEL 8.8 it looks like you are using the 3.11 version of Singularity, but on 8.9 you are using the 4.1 version? If so, it's not a huge surprise that the behaviour is different. The dependency used for extracting images was changed between 3.x and 4.x version of SingularityCE. There was a fix for an issue released in 4.1.2 which is quite likely to related to the error that you see on 4.1.1. https://github.com/sylabs/singularity/releases/tag/v4.1.2
If you can try v4.1.2 you might find it fixes the issue you are seeing. |
@dtrudg - building container with rhel 8.9 with singularity version 4.1.2 did work. |
@tri-adam has done some experiments outside of Singularity and cannot replicate. Quite likely that it's something related to our progress bar implementation which is providing a custom http RoundTripper. Will need to look into this and see if response bodies are not being closed through the progress proxyreader... and if so, why. Could also be somewhere else... more digging required. |
Spent a lot of time trying to replicate this error today, pulling the container from NGC.
I note that the same issue when fetching from NGC has been reported against another tool, zarf: Also note that someone saw a similar error message from crane: google/go-containerregistry#1392 ... and the response noted an open go issue: Having looked through our code, we are doing a straight Write into a ggcr v1.Layout from a v1.Image in a registry. We don't do anything with the layers. We aren't responsible for closing any response bodies as we don't use them ourselves. Our progress bar code is supplying ggcr with a custom http.RoundTripper - but the this just sets I am stumped... this could be a NGC + HTTP2 specific issue? Is it possible NGC is served from multiple locations, and that could explain my inability to replicate? |
Other observations...
|
sorry for the late reply! I was able to reproduce this issue on on-prem rhel8.8 server with singularity version 4.1.1. but like you mentioned, i have seen the same def file work on cloud instances of same configs. let me know if i can help anyway with any data to debug the issue until you are able to reproduce the error. |
Does your on-prem environment happen to be in North America? ... East coast non-cloud is the only place we've been able to reproduce so far. At this point it doesn't seem likely that there is a bug in Singularity itself. It is most likely an interaction between the NGC host and the Go HTTP2 client libraries causing issues, as far as I can tell. If true, it's not something we can necessarily fix. |
yes, it is in North America. i can also reproduce this from asia as well. from what you have tested, it does sound like this is mostly due to GO HTTP2 client libraries. as for the other issue of "file not found" - like you have pointed this has been addressed in v4.1.2. hopefully GO will fix the issue. until then i will rely on the work around you have provided ! |
i use singularity RPM that comes with RHEL EPEL. we have recently updated the OS image from rhel 8.8 to 8.9 and since then i am unable to build singularity containers. i can build small containers like lolcow. but large images like the one's built using NGC are failing with internal server error.
sharing the details of my observation and error below.
on RHEL 8.8
host1 $ ~/$ singularity --version
singularity-ce version 3.11.5-1.el8
host1 $ ~/$ uname -r
4.18.0-477.27.1.el8_8.x86_64
container build is successful here. output file attached.
on RHEL 8.9
host2$ ~/$ singularity --version
singularity-ce version 4.1.1-1.el8
host2 $ /scratch/$ uname -r
4.18.0-513.9.1.el8_9.x86_64
container build fails here. output file attached.
i also tried to upgrade singularity version on a rhel 7 system to 4.1.1 and build the same container, but i get the same error as i get on rhel 8.9.
host3 $ /usr/local/bin/$ singularity --version
singularity-ce version 4.1.0-rc.1+213-ge0050f0
host3 $ uname -r
3.10.0-1160.92.1.el7.x86_64
i have tried to disable all the iptable rules and disabled selinux to no luck.
I have attached the def file and the output of the error and successful builds for your reference. can you please help me with this issue? Thank you!
modulus-def.txt
rhel88-singV3-host1-output.txt
rhel89-singV4-host2-output.txt
The text was updated successfully, but these errors were encountered: