New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent "connection reset by peer" while pushing image #785
Comments
@pliljenberg can you get nginx error / access log, and registry logs as well? |
@dmp42 I'll see what I can dig out. Log settings is INFO I think so it might not say much. I looked at the logs and there were no errors in them, just |
nginx logs for build, it starts by getting a base image and then ends on a POST with no errors:
Corresponding registry logs, same thing here...
|
No error at all in your server logs rule out registry and nginx. Do you use any sort of proxy? Any chance you can mtr there and figure if you have packet loss? |
Had to open for
Not sure if it is related but I upgraded the docker-client (on the CI host) to Instead we got a Networking seems to be a bit in-stable for docker in general? Not sure if that's environment related, go related or docker related. |
These issues are likely unrelated. i/o timeout most of the time really is just an indication that you are having connectivity issues. The MTR output doesn't look right. Any chance you can run it again? |
@dmp42 I ran
64 bytes from ec2-52-17-235-32.eu-west-1.compute.amazonaws.com (52.17.235.32): icmp_seq=1 ttl=63 time=0.495 ms Running against the raw IP gives the same result, am I doing it wrong?
Start: Wed Aug 5 08:57:35 2015 Could be a problem with From my local machine I get:
|
I seem to be having the same issue. Also on Amazon ec2 with docker 1.7.1 (I am getting the connection reset by peer message). I just enabled s3 logs, waiting for it to take affect to see those log outputs |
I can't see any s3 logs about connection reset or timeouts.. |
@jasonf20 for @pliljenberg there is no logs either on the server side. This further point to network issues in between. Running MTR and spotting packet loss is the only way to get there I think. |
I am having the same issue. I've got a registry-v2 on an ec2 instance behind nginx. I've got another instance running Jenkins building images. When I push an image I get the |
If that happens only on jenkins, then the issue is likely there... is there any proxy there? virtual machine involved? running in a container? |
@dmp42 In my case I have a server running teamcity (amazon ec2 server - with no inner vm or docker) which sends directly to the docker-registry (also directly on ec2 instance). No nginx, proxy or the likes. I do have TLS enabled on the registry. I think that even if this is a network related issue, (I'll also try to run mtr soon - but suspect no packet loss) then docker should be more resilient towards it. For example perform a couple of retries, or resend of failed data. EDIT: Here is the MTR (which is fine as I suspected - I also let it run for longer, with no issues):
|
Additionally, I also ran |
I did some more testing. I've got 2 EC2 instances. One is running a registry-v2 container, the other has an image which I am trying to push to the registry. I also have my own machine which is not in AWS, it is my laptop. 1: Push to local address from other ec2 instance I tagged the image I am trying to push with the local ip addess and port of the registry: docker tag someplace/someimage 172.x.x.x:5000/someimage I added the registry as an insecure registry to the docker opts and restarted the daemon. I can push the image successfully (it was not present in the registry). I removed all data from the registry and pushed the image again successfully (I did this 3 times to be sure it wasn't a one-off) 2: Push to external address from other ec2 instance I tagged the image I am trying to push with the external ip addess and port of the registry: docker tag someplace/someimage 52.x.x.x:5000/someimage I added the registry as an insecure registry to the docker opts and restarted the daemon. When I push the image I get an I removed all data from the registry and pushed the image again without success (I did this 3 times to be sure it wasn't a one-off) Some layers are pushed successfully, but then one will fail. When I try again some more will be pushed when I am lucky. There is a layer of 200MB which I have not been able to push. 3: Push to external address from a machine outside of AWS I tagged the image I am trying to push with the external ip addess and port of the registry: docker tag someplace/someimage 52.x.x.x:5001/someimage I added the registry as an insecure registry to the docker opts and restarted the daemon. I can push the image successfully (it was not present in the registry). I removed all data from the registry and pushed the image again successfully (I did this 3 times to be sure it wasn't a one-off) 4: Push to hostname + nginx from other ec2 instance I tagged the image I am trying to push with the hostname of the registry: docker tag someplace/someimage registry.example.com/someimage I added the registry as an insecure registry to the docker opts and restarted the daemon. When I push the image I get an I removed all data from the registry and pushed the image again without success (I did this 3 times to be sure it wasn't a one-off) Some layers are pushed successfully, but then one will fail. When I try again some more will be pushed when I am lucky. 5: Push to hostname + nginx from machine outside of AWS I tagged the image I am trying to push with the hostname of the registry: docker tag someplace/someimage registry.example.com/someimage I added the registry as an insecure registry to the docker opts and restarted the daemon. I can push the image successfully (it was not present in the registry). I removed all data from the registry and pushed the image again successfully (I did this 3 times to be sure it wasn't a one-off) Final notes:
|
So it appears to be an issue with Amazon resetting connections when using external addresses between instances. And not a docker or registry or nginx issue. Is somebody able to reproduce this? |
@VogonogoV smells like ELB timeout? |
@dmp42 I don't use ELB. Any other suggestions? I posted it on the Amazon forums as well, maybe they know whats up. |
I have the same issue exactly, but do not use elb, so I don't think it's
|
Ok. Keep us posted if you get any infos from Amazon / forums. |
I haven't had time to read through the full issue but I suspect that the |
@stevvooe Thanks for the suggestion, but nginx is not the problem. I'll summarize the issue as I am seeing it.
When it doesn't work I get My conclusion: it is not a docker, registry, nor nginx problem. I think it's an AWS problem. @dmp42 I'll try tcpdumping when I have the time. I'll post updates here. |
@VogonogoV That table makes it pretty clear that the issue has something to do with ec2. I wonder if you need to use a different dns entry for inside ec2. The issue may be that routing using the externally available 52.x.x.x IP is not optimal for host to host ec2 transfers. |
@stevvooe As a workaround I am now building images on the ec2 machine and tagging them with the local address: |
@VogonogoV Did you try tweeting? Their support tends to work better with motivation. ;) @RichardScothern @dmp42 @aaronlehmann Should we add a blurb in documentation about this when using s3? |
I hit the same problem. Any progress on the AWS side? I would like to avoid the workaround from @VogonogoV |
@robertfirek Unfortunately, this is an AWS issue and we don't really have the means to investigate. I'd recommend you open an issue with their support team. @dmp42 Any ideas here? |
I haven't heard anything from AWS. I didn't tweet because I don't have Twitter. So anyone who wants to, feel free. Opening an issue might give some more insights. I've been using the workaround for 3 weeks now an I haven't looked back at it again. |
@stevvooe @VogonogoV Thank you for your update. |
@VogonogoV @robertfirek I have flexed the power of the tweet: https://twitter.com/stevvooe/status/644448877978845184. We'll see if they help. @VogonogoV Do you have a link to the forum post? |
Any news on this issue? |
I was having this issue and thought it might be related to timeouts based on an earlier comment. I thought this might have something to do with the block write size being too large and occasionally timing out so I changed (shortened? it's in KB, right?) it and explicitly set to the below and the problems seemed to resolve afterwards. storage: |
@13josh Users experiencing this issue are having connection resets while pushing and pulling over some VPC configuration. If you are experiencing this issue, changing the buffer size may help, for certain configurations, as it can change the connection envelope for a single given chunk request. Albeit, we am not sure that changing buffer size having an effect is indicative of this problem. In fact, if that worked, it may be a different issue. @EliasGabrielsson @13josh I'd recommend relating your experience to https://forums.aws.amazon.com/thread.jspa?threadID=215090. |
I get this issue as well, Im not using ec2 but instead a private network. Registry version 2 behind nginx proxy. |
@dolphyvn This issue covers "connection reset" with AWS. If you're having this while not using AWS, you're likely having another, unrelated problem. |
FYI, after spending a few hours on this with a sysadmin, we figured out that when you use After a bunch of tcpdump and stuff, it seems that a RST packet are being sent by the host (and not the docker container), resulting in "connection reset by peer". When you Hope it helps, I'm copying this on the AWS issue too Bye |
Well this is unfortunate because this means you can't reliably host distribution on AWS using ECS since ECS doesn't support |
@kepeket That is an interesting data point. It might be worthwhile relating your experience on https://forums.aws.amazon.com/thread.jspa?threadID=215090. |
Hi Yes i tried but the login page of Amazon was infinite looping I don't why I'll retry a bit later
|
I hit the issue when I did a I collected a tcpdump to see what was happening. At the moment the upload failed, the EC2 instance was receiving a packet that was very far out of sequence. Its sequence number and timestamp were several seconds behind the actual TCP stream. Interestingly, this did not seem to be a retransmit of a previously-sent packet. Presumably this packet is generated within AWS' infrastructure. Normally a packet like this should be treated as a spurious retransmit and ignored, but for some reason it was causing the local host to generate a RST packet and kill the connection. Given the anecdote in #785 (comment) that running the registry container in host networking mode works around the issue, I suspected this had something to do with how Docker bridge networking works. When operating in bridged mode, Docker creates some iptables rules to perform NAT between the exposed address/port and the container's internal address/port. I had a look at Linux's NAT implementation, which builds on top of
...I haven't been able to trigger the issue anymore. If this is indeed a successful workaround, should we consider having Docker Engine switch on that flag by default? |
@aaronlehmann Are you sure this workaround fixes the problem? Because |
@mrjana: I'll do more testing tomorrow to confirm the workaround. I'm not sure what causes the RST packet to get generated. Perhaps there's an FWIW, I found this post, which describes a similar problem that was resolved by switching on the "be liberal" flag: https://www.pitt-pladdy.com/blog/_20091125-185551_0000_Linux_Netfilter_and_Window_Scaling/ |
@mrjana: I can confirm that the setting makes a difference. I set |
Thanks for confirming that @aaronlehmann. I took a look at all the places in the kernel where an RST could be generated and they are mainly generated when disconnect or close happened with unread data or if there was memory pressure in the host. So looks like RST is happening due to some indirect sequence of events after receiving the unexpected sequence number packer and I think we need to connect some missing dots there. I will try to take a look at the packet capture that you sent me and see if I could see when exactly RST is getting generated in relation to the packet with the unexpected sequence number. |
@mrjana: I think I get it now. When conntrack treats a packet like this as "invalid", it doesn't associate it with the flow that its tuple corresponds to. Thus, the packet doesn't get rewritten by the NAT rule, and ends up being handled as if it was part of a connection to the host's actual IP address. The host sees that it doesn't have a matching flow, and (correctly) sends a RST packet. I found I can also work around this by adding a rule to the
This prevents the packet from being interpreted as destined to the pre-NAT IP address, and prevents the RST from being generated. |
@aaronlehmann Yes I agree with your prognosis. That is indeed what is happening here. And having to drop if ctstate is invalid is a much better fix than enable be_liberal flag. Although it probably should be inserted in DOCKER chain in the PREROUTING hook. Can you please file an issue in docker/libnetwork to get it fixed although it has very limited chance of making it into 1.11? |
@aaronlehmann @mrjana Bravo! |
Filed moby/libnetwork#1090. Will also reach out to AWS with our conclusions. |
I reported the findings to AWS. Some testing showed this affected instances in Ireland, but not Northern Virginia. They were able to reproduce the problem, and seem to have fixed it in the Ireland region. I'm going to close this ticket since we now know it's not a distribution-specific problem, and the AWS issue appears to be resolved. |
I recently hit the same issue. It was caused by a problem affecting my network card. In my host (Windows Server 2016) when Hyper-V is enabled for docker, i have created an external virtual switch using the Network Interface card, and since that i hit the issue. It was resolved when i removed the Virtual Switch. Long story short, check your network interfaces (drivers, firmware...) |
My 3G internet network disconnects when I try to |
@ikr0m then it's likely an issue with your 3G provider, right? |
@dubo-dubon-duponey I'm not sure why it should be issue with provider. After executing |
It's hard to tell without more debugging information, but what you describe here looks like your 3G connection is not stable - which is why I suggested it might be an issue with your provider. If that's the case, then it's unlikely there is anything docker can do here. "docker push" does not make you lose your uplink (3G or otherwise) by itself... Suggestion: try to HTTP POST random content using curl, and see if it triggers the same problem with your connection. |
here iam also getting same error #Connection reset by peer while pushing image to docker registry via jenkins, |
We're running docker-registry-v2 on an AWS EC2 instance - backed by an EBS volume (switched from S3 since we thought that might be the underlying issue).
From another AWS EC2 instance we're running a Bamboo CI agent which are building docker images and pushes them to our docker-registry.
Several times each day we get failed builds caused by
docker push
gettingconnection reset by peer
.Configuration and host information below, how can I debug this issue?
The docker registry is fronted by
nginx
configured as below:Docker registry configuration:
docker compose is used to start it all:
Docker registry host:
Bamboo CI host:
The text was updated successfully, but these errors were encountered: