Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP2 INTERNAL_ERROR when pulling large image from NGC with CE 4.1.x #2854

Closed
ashwinidr23 opened this issue Apr 29, 2024 · 11 comments
Closed

Comments

@ashwinidr23
Copy link

i use singularity RPM that comes with RHEL EPEL. we have recently updated the OS image from rhel 8.8 to 8.9 and since then i am unable to build singularity containers. i can build small containers like lolcow. but large images like the one's built using NGC are failing with internal server error.
sharing the details of my observation and error below.

on RHEL 8.8
host1 $ ~/$ singularity --version
singularity-ce version 3.11.5-1.el8
host1 $ ~/$ uname -r
4.18.0-477.27.1.el8_8.x86_64
container build is successful here. output file attached.

on RHEL 8.9
host2$ ~/$ singularity --version
singularity-ce version 4.1.1-1.el8
host2 $ /scratch/$ uname -r
4.18.0-513.9.1.el8_9.x86_64
container build fails here. output file attached.

i also tried to upgrade singularity version on a rhel 7 system to 4.1.1 and build the same container, but i get the same error as i get on rhel 8.9.
host3 $ /usr/local/bin/$ singularity --version
singularity-ce version 4.1.0-rc.1+213-ge0050f0
host3 $ uname -r
3.10.0-1160.92.1.el7.x86_64
i have tried to disable all the iptable rules and disabled selinux to no luck.

I have attached the def file and the output of the error and successful builds for your reference. can you please help me with this issue? Thank you!
modulus-def.txt
rhel88-singV3-host1-output.txt
rhel89-singV4-host2-output.txt

@dtrudg
Copy link
Member

dtrudg commented Apr 30, 2024

It appears that this is due to an issue with the golang/go-containerregistry dependency that 4.x is using to obtain images.

There is a problem related to HTTP2 flow control, that manifests with newer golang/x/net - the same issue has been seen elsewhere:

googleapis/google-cloud-go#7440 (comment)

Probably only triggered by the very large NGC images, as smaller images don't involve a lot of streams / hitting timeouts for stuck streams.

As a workaround please could you try disabling HTTP2 using the GODEBUG=http2client=0 variable:

# If running build without sudo
export GODEBUG=http2client=0
singularity build .....

# If running build with sudo
sudo GODEBUG=http2client=0 singularity build ...

This should force it to use HTTP1, which hopefully works. We will have to follow up on the issue with the developers at Google who produce the go-containerregistry module.

@dtrudg
Copy link
Member

dtrudg commented Apr 30, 2024

Need to check our code too... is there anywhere we are holding open a reader that we receive from ggcr?

@ashwinidr23
Copy link
Author

Thank you very much for the work around!
this seems to be working for few of the containers that i tried except for the def file that i shared here. this may or may not have to do with latest singularity version, but its quite weird that i am able to build the container from the same def file successfully from rhel 8.8 , but on rhel 89 it gives me error.(shared below)

FATAL: While performing build: packer failed to pack: while unpacking tmpfs: error unpacking rootfs: unpack entry: opt/conda/pkgs/black-22.1.0-pyhd8ed1ab_0/site-packages/black-22.1.0.dist-info/AUTHOR S.md: link: no such file or directory

it appears as the latest singularity version does not give a lot of build details as older versions from the output. below are the full output for both the use case.

host1-rhel88-output.txt
host2-rhel89-output.txt

@dtrudg
Copy link
Member

dtrudg commented Apr 30, 2024

@ashwinidr23 - to confirm... disabling http2 works to stop the error during download? If so, I'm glad and thank you for trying that.

On RHEL 8.8 it looks like you are using the 3.11 version of Singularity, but on 8.9 you are using the 4.1 version?

If so, it's not a huge surprise that the behaviour is different. The dependency used for extracting images was changed between 3.x and 4.x version of SingularityCE. There was a fix for an issue released in 4.1.2 which is quite likely to related to the error that you see on 4.1.1.

https://github.com/sylabs/singularity/releases/tag/v4.1.2

Fix target: no such file or directory error in native mode when extracting layers from certain OCI images that manipulate hard links across layers.

If you can try v4.1.2 you might find it fixes the issue you are seeing.

@dtrudg dtrudg changed the title singularity build error with version 4.1.1-1.el8 HTTP2 INTERNAL_ERROR when pulling large image from NGC with CE 4.1.x Apr 30, 2024
@ashwinidr23
Copy link
Author

@dtrudg - building container with rhel 8.9 with singularity version 4.1.2 did work.
Thanks a lot for the help! its been a blocker since few days for me :)

@dtrudg
Copy link
Member

dtrudg commented May 1, 2024

@tri-adam has done some experiments outside of Singularity and cannot replicate. Quite likely that it's something related to our progress bar implementation which is providing a custom http RoundTripper. Will need to look into this and see if response bodies are not being closed through the progress proxyreader... and if so, why.

Could also be somewhere else... more digging required.

@dtrudg
Copy link
Member

dtrudg commented May 2, 2024

Spent a lot of time trying to replicate this error today, pulling the container from NGC.

  • I was unable to replicate it from my slow-ish UK VDSL connection.
  • I was unable to replicate it from a US-EAST-1 AWS EC2 instance... even when adding artificial packet loss, and trying out interface throughput restrictions.
  • @wobito was able to reproduce multiple times from his 1Gbps connection in Canada.
  • @tri-adam was able to reproduce previously from his >1Gbps connection in Canada.

I note that the same issue when fetching from NGC has been reported against another tool, zarf:

defenseunicorns/zarf#2408

Also note that someone saw a similar error message from crane:

google/go-containerregistry#1392

... and the response noted an open go issue:

golang/go#36759

Having looked through our code, we are doing a straight Write into a ggcr v1.Layout from a v1.Image in a registry. We don't do anything with the layers. We aren't responsible for closing any response bodies as we don't use them ourselves.

Our progress bar code is supplying ggcr with a custom http.RoundTripper - but the this just sets response.Body to a mpb.ProxyReader(response.Body) ... we don't consume / modify / handle closing of the response.Body.

I am stumped... this could be a NGC + HTTP2 specific issue? Is it possible NGC is served from multiple locations, and that could explain my inability to replicate?

@dtrudg
Copy link
Member

dtrudg commented May 2, 2024

Other observations...

  • NGC is using Akamai. @tri-adam and @wobito are hitting a different endpoint than I was from us-east-1 or the UK.
  • @wobito seems to be able to able to get different behaviour through throttling. 15Mbps and above the error was thrown.
  • Always successful for me from us-east-1, including with similar throttling.
  • Similar errors don't happen for comparably large images in Docker Hub.

@ashwinidr23
Copy link
Author

sorry for the late reply! I was able to reproduce this issue on on-prem rhel8.8 server with singularity version 4.1.1. but like you mentioned, i have seen the same def file work on cloud instances of same configs. let me know if i can help anyway with any data to debug the issue until you are able to reproduce the error.

@dtrudg
Copy link
Member

dtrudg commented May 7, 2024

sorry for the late reply! I was able to reproduce this issue on on-prem rhel8.8 server with singularity version 4.1.1. but like you mentioned, i have seen the same def file work on cloud instances of same configs. let me know if i can help anyway with any data to debug the issue until you are able to reproduce the error.

Does your on-prem environment happen to be in North America? ... East coast non-cloud is the only place we've been able to reproduce so far.

At this point it doesn't seem likely that there is a bug in Singularity itself. It is most likely an interaction between the NGC host and the Go HTTP2 client libraries causing issues, as far as I can tell. If true, it's not something we can necessarily fix.

@ashwinidr23
Copy link
Author

ashwinidr23 commented May 7, 2024

yes, it is in North America. i can also reproduce this from asia as well. from what you have tested, it does sound like this is mostly due to GO HTTP2 client libraries. as for the other issue of "file not found" - like you have pointed this has been addressed in v4.1.2.

hopefully GO will fix the issue. until then i will rely on the work around you have provided !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants