Add workaround for spurious retransmits leading to connection resets #1090

aaronlehmann · 2016-04-08T18:56:29Z

There is a longstanding issue over at distribution/distribution#785 where users reported connection resets trying to push to an AWS-hosted registry from inside the AWS network. After months, we've finally narrowed this down to a bad interaction between spurious TCP retransmits and the NAT rules that Docker sets up for bridge networking.

Here is a summary of what happens:

For some reason, when an AWS EC2 machine connects to itself using its external-facing IP address, there are occasional packets with sequence numbers and timestamps that are far behind the rest.
Normally these packets would be ignored as spurious retransmits. However, because the packets fall outside the TCP window, Linux's conntrack module marks them invalid, and their destination addresses do not get rewritten by DNAT.
The packets are eventually interpreted as packets destined to the actual address/port in the IP/TCP headers. Since there is no flow matching these, the host sends a RST.
The RST terminates the actual NAT'd connection, since its source address and port matches the NAT'd connection.

I think it would be hugely helpful for libnetwork to include a workaround for this. It has affected a lot of users trying to use the registry in AWS, and it presumably affects other Dockerized applications as well. While I'll reach out to AWS to point out the spurious retransmits, I don't know if they'll be able to fix them, and there may also be other environments with similar issues.

I've found two possible workarounds:

Turn on conntrack's "be liberal" flag: echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal. This causes conntrack/NAT to treat packets outside the TCP window as part of the flow being tracked, instead of marking them invalid and causing them to be handled by the host.
Add a rule to drop invalid packets instead of allowing them to trigger RSTs: iptables -I INPUT -m conntrack --ctstate INVALID -j DROP

Both of these can potentially affect non-Docker traffic. The former causes NAT to forward packets that it would otherwise err on the side of not forwarding, which seems relatively harmless, but it's a system-level setting, so it's not limited to Docker flows. The latter would drop any packets that conntrack deems invalid, system-wide, unless we added specific destination filters for the addresses/ports that Docker set up NAT rules for, which could add overhead.

It may be too late to hope for a workaround to be included in Docker 1.11, but anything we can do on this front will really improve the lives of Docker users on AWS.

The text was updated successfully, but these errors were encountered:

thaJeztah · 2016-05-02T06:00:50Z

@aaronlehmann I saw the linked issue turned out to be an issue with AWS, is there still something that needs to be done in libnetwork?

aaronlehmann · 2016-05-02T17:18:12Z

@thaJeztah: This issue is a suggestion to work around problems like this in libnetwork. The problem came from a combination of invalid packets generated somewhere in AWS' infrastructure, and the NAT setup used by libnetwork reacting to those invalid packets by tearing down the connection. This means the invalid packets cause problems for Dockerized applications but they are harmless for most other setups. moby/moby#19532 revealed that this problem was also seen on a residential internet connection. I think there is value in finding a workaround.

jrabbit · 2016-06-04T00:19:40Z

I'm being bit by this in production what more information could I provide?

middleagedman · 2016-06-26T04:57:21Z

Same here.. Simple docker container build on an arch linux system in residential. Just trying to do a git clone from a https git site (bitbucket).

GnuTLS recv error (-54): Error in the pull function.
Closing connection 1
error: RPC failed; result=56, HTTP code = 200
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

BenSjoberg · 2016-07-28T01:33:14Z

Just ran into this on my office's internal network. Thankfully I found this page or all my hair would be ripped out by morning.

The iptables workaround did the trick for me, thanks very much for providing that. If it helps, I'm running Docker 1.11.2 on Ubuntu 16.04. Let me know if there's any more information I can give that would be useful.

GordonTheTurtle · 2017-08-30T00:27:26Z

@aaronlehmann It has been detected that this issue has not received any activity in over 6 months. Can you please let us know if it is still relevant:

For a bug: do you still experience the issue with the latest version?
For a feature request: was your request appropriately answered in a later version?

Thank you!
This issue will be automatically closed in 1 week unless it is commented on.
For more information please refer to #1926

aaronlehmann · 2017-08-30T03:17:42Z

A fix was implemented in AWS. I don't think a workaround is necessary anymore.

mitchcapper · 2017-09-07T01:05:46Z

I will comment that this does happen on networks outside of AWS. The iptables fix does fix it HOWEVER you first have to find this issue to learn that. The errors are very generic, so if implementing the fix in docker is not a big deal it would probably save some people many hours of research into it:)

vduglued · 2017-10-14T17:17:56Z

Any solution to this problem on a macOS host?

p53 · 2017-11-03T11:46:12Z

we have similar problem downloading file to our docker image from nexus throws connection reset by peer, adding iptables rules fixes it

guillon · 2018-09-28T12:34:24Z

As it has been reported multiple times (@middleagedman, @BenSjoberg, @mitchcapper , @p53) the fix in the iptables resolves the issue ('connections reset by peer' or RST packet sent at TCP level).
Quick fix (ref @aaronlehmann): iptables -I INPUT -m conntrack --ctstate INVALID -j DROP

The issue is actually occurring in any container running in the default bridge network. Whether the issue occurs frequently or not depends on lot of factors (bandwidth, latency, host load). For sure, it occurs at some point. This issue is probably most of the time non-understood and incorrectly explained by a possible transient network partition, but it is not. It is a bug in the NAT setup installed by Docker.

We face this issue with a perfectly valid TCP client-server transfer (for instance a curl from a container downloading a large file though HTTP from an external server at high throughput). Do the very same download from the host directly and all is fine. Do it from a container on the same host and it breaks.

The problem as already mentioned by @aaronlehmann is that benign "invalid" packets to the SNAT'ed container (caused for instance by TCP window overflow due to high throughput but slow client) are assigned to the host interface and considered incorrectly martians, which causes a connection reset.
This is a limitation of conntrack which does not differentiate perfectly legal packets causing TCP window overflow from actually malformed packets (all get treated as INVALID). Hence the need to drop any conntrack INVALID packet seen when installing SNAT'ed virtual networks.

This is a problem references at several places, due to this netfilter/conntrack limitation:
https://serverfault.com/a/312687
https://www.spinics.net/lists/netfilter/msg51409.html
Quoting the last link from netfilter mailing list:

If NAT is enabled, never ever let packets with INVALID state pass through, because NAT will skip them.
Best regards,
Jozsef

The source NAT setup in iptables are installed by Docker for its bridge network support and are thus incomplete.
It should be the responsibility of Docker to set this up correctly.
Apparently this was never fixed, hence my request to re-open this issue.

I can attempt to make a pull request if it can help, or I can open a new issue if needed, tell me.

Note that the abandoned pull request attempt #1129 does not fix the issue because the inserted rule does not drop the packets. There should be no filter on the destination because at that time the destination is not yet NAT'ed. Any conntrack invalid packets in filter INPUT chain have to be dropped as in : iptables -I INPUT -m conntrack --ctstate INVALID -j DROP.

Add drop of conntrack INVALID packets in input such that invalid packets due to TCP window overflow do not cause a connection reset. Due to some netfilter/conntrack limitations, invalid packets are never treated as NAT'ed but reassigned to the host and considered martians. This causes a RST response from the host and resets the connection. As soon as NAT is setup, for bridge networks for instance, invalid packets have to be dropped in input. The implementation adds a generic DOCKER-INPUT chain prefilled with a rule for dropping invalid packets and a return rule. As soon as some bridge network is setup, the DOCKER-INPUT chain call is inserted in the filter table INPUT chain. Fixes moby#1090. Signed-off-by: Christophe Guillon <christophe.guillon@st.com>

dcui · 2019-04-30T23:49:36Z

FYI:
"/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal" has gone since 2016-08-13 (see "netfilter: remove ip_conntrack* sysctl compat code" https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adf0516845bcd0e626323c858ece28ee58c74455)

Now I think we should use "/proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal" instead.

johannesboon · 2019-06-08T21:37:13Z

FYI: This is also an issue for kubernetes that they are trying to solve with similar strategies:

kubernetes/kubernetes#74840

guillon · 2019-06-10T13:33:51Z

Hi @aaronlehmann,
I think that the issue was closed but never fixed, can you consider re-opening it.
Note that the pr #2275 solves the issue.

unilynx · 2019-11-28T22:59:20Z

I'm using neither AWS nor Kubernetes, and I see the issue too between our office network (where our CI runners use) and external resource at digitalocean or maxmind.com. It generally manifests itself as

curl: (56) OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104

With tcpdumps I see lost but then reappearing packets (it reappeared after about 90ms or 200KB of data) triggering a RST. I'm not sure where the actual problem is, I'm assuming our ISP is doings something funky or a link aggregation is messing up packets. It happens mostly during quiet hours and the actual network issue is something we probably have to live with, but a 90ms packet delay shouldn't terminate connections

The liberal sysctl fixes our issue (and firewalling RST probably too), but as the issue is not AWS (or even K8S specific) I too think this issue should be reopened.

Drop invalid packets to deal with moby/libnetwork#1090

rwkarg · 2020-06-22T21:17:48Z

This is impacting us as well just using docker. Should this issue be reopened?

leakingtapan · 2020-10-26T06:06:14Z

Had the same issue on GCP when downloading large file from inside container using curl. The iptables rule solves the problem for me. Another workaround was to use wget instead of curl, not this workaround might not be generally applicable to all cases

ssup2 · 2020-11-19T14:13:02Z

Hello. To solve this problem, I developed a kubernetes controller called node-network-manager. By simply deploying and configuring network-node-manager, you can set iptables -I INPUT -m conntrack --ctstate INVALID -j DROP rule to all nodes of cluster. Please try this and give me feedback. Thanks.

https://github.com/kakao/network-node-manager

karunchennuri · 2020-12-17T18:20:48Z

This issue pretty much exists in non-AWS, non-GCP world as well. We run our clusters on-prem and were able to reproduce this issue esp with requests going outbound with higher payloads. Getting into details...

Problem: An app team complained an issue with their app behavior. This app reaches outbound external service with certain sizes of payloads. In literal cURL world, it's nothing but passing JSON payloads in --data-raw. What was weird was that the requests went through fine with smaller payloads, but when the payload size reaches certain KB, the request goes outbound through firewall, gets executed on external service but response never reaches the container. We thought it's intermittent issue, but NO we could reproduce this issue 100% with certain request payload size.

Steps we took to narrow down:

To remove any possible bad behavior of app itself due to coding issue, we wrote a simplest client i.e. running the cURL directly from with in SSH'd container instance.
We ran the curl from worker node with smaller payload where the container is hosted, this worked
Ran the curl from worker node with larger payload, this worked
We then ran the cURL with smaller payload from within container (app instance), this worked
Ran the curl from container with larger payload, this failed (intermittent at times)
Took packet captures on the container virtual interface (overlay networking) and eth0 default interface.
Packet captures on the virtual interface had no abnormal behavior. But pcap on eth0 showed RST connections from worker node to external service within a second or 2 of the request initiation.
We took captures on the external endpoint as well as on firewall. All of them showed symptom of the problem but not the root cause.
We tried running the same cURL on other clustered environment based on Kubernetes. We could reproduce this issue on every docker runtime. Though Cloudfoundry uses garden technology, but it still delegates the job to RunC which is the runtime for container based on Docker code.
For us running this command on the echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal on the worker node did the trick! Thanks @aaronlehmann for taking time drafting this issue. Had we not stumbled on this issue, not sure how many man hours we would have spent around troubleshooting.

Since this is a system level setting that impacts not just docker traffic, we are still looking at best action that meets our environment needs. I am not inclined to give a resolution step, but just thought will put my thoughts/experience w.r.t this issue on how this took several man hours of effort to identify the root cause. Reading through above responses, 'am curious to know how this was fixed in AWS and why or if there exists a fix for this in any of the docker releases (considering this issue showed up 4 years ago). If this is not yet fixed, what's the best way forward to reopen this issue?

akerouanton · 2023-08-07T13:47:48Z

The fact that AWS implemented a fix doesn't mean this issue disappeared. As mentioned by users above, this can still happen in some cases. I'll reopen it and I'll backport the PR submitted by @guillon into github.com/moby/moby in the upcoming weeks.

aaronlehmann mentioned this issue Apr 8, 2016

Intermittent "connection reset by peer" while pushing image distribution/distribution#785

Closed

aboch mentioned this issue Apr 20, 2016

Drop invalid packets destined to bridge networks #1129

Closed

mevatron mentioned this issue May 1, 2016

hypriot/docker network stability issues. moby/moby#19532

Closed

StefanScherer mentioned this issue May 2, 2016

Improve hypriot/docker network stability hypriot/image-builder-rpi#57

Closed

aaronlehmann closed this as completed Aug 30, 2017

p53 mentioned this issue Nov 3, 2017

please add firewall rule for docker NAT #2009

Closed

guillon linked a pull request Oct 2, 2018 that will close this issue

Fix bridge connection reset due to invalid packets #2275

Open

anfernee mentioned this issue Mar 2, 2019

"Connection reset by peer" due to invalid conntrack packets kubernetes/kubernetes#74839

Closed

DP19 mentioned this issue Oct 16, 2019

Random / Sporadic 502 gateway timeouts kubernetes/ingress-nginx#4433

Closed

danking mentioned this issue Mar 18, 2020

[batch] fix retry of clone hail-is/hail#8190

Merged

Hexcles mentioned this issue Apr 24, 2020

TaskCluster sometimes fails due to network issues web-platform-tests/wpt#21529

Closed

imbstack added a commit to mozilla-platform-ops/monopacker that referenced this issue Apr 24, 2020

Drop invalid packets to deal with moby/libnetwork#1090

7b41c8c

imbstack mentioned this issue Apr 24, 2020

Drop invalid packets to deal with moby/libnetwork#1090 mozilla-platform-ops/monopacker#53

Merged

imbstack added a commit to mozilla-platform-ops/monopacker that referenced this issue Apr 24, 2020

Merge pull request #53 from taskcluster/wpt-21529

4ef6da8

Drop invalid packets to deal with moby/libnetwork#1090

uablrek mentioned this issue May 12, 2023

"Connection reset by peer" due to invalid conntrack packets kubernetes/kubernetes#117924

Closed

sam-thibault mentioned this issue Aug 7, 2023

Ubuntu/Debian - connection resets inside container moby/moby#20220

Closed

akerouanton reopened this Aug 7, 2023

uablrek mentioned this issue Sep 6, 2023

only drop invalid cstate packets if non liberal kubernetes/kubernetes#120412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workaround for spurious retransmits leading to connection resets #1090

Add workaround for spurious retransmits leading to connection resets #1090

aaronlehmann commented Apr 8, 2016

thaJeztah commented May 2, 2016

aaronlehmann commented May 2, 2016

jrabbit commented Jun 4, 2016

middleagedman commented Jun 26, 2016

BenSjoberg commented Jul 28, 2016

GordonTheTurtle commented Aug 30, 2017

aaronlehmann commented Aug 30, 2017

mitchcapper commented Sep 7, 2017

vduglued commented Oct 14, 2017

p53 commented Nov 3, 2017 •

edited

guillon commented Sep 28, 2018 •

edited

dcui commented Apr 30, 2019

johannesboon commented Jun 8, 2019

guillon commented Jun 10, 2019

unilynx commented Nov 28, 2019

rwkarg commented Jun 22, 2020 •

edited

leakingtapan commented Oct 26, 2020 •

edited

ssup2 commented Nov 19, 2020

karunchennuri commented Dec 17, 2020 •

edited

akerouanton commented Aug 7, 2023 •

edited

Add workaround for spurious retransmits leading to connection resets #1090

Add workaround for spurious retransmits leading to connection resets #1090

Comments

aaronlehmann commented Apr 8, 2016

thaJeztah commented May 2, 2016

aaronlehmann commented May 2, 2016

jrabbit commented Jun 4, 2016

middleagedman commented Jun 26, 2016

BenSjoberg commented Jul 28, 2016

GordonTheTurtle commented Aug 30, 2017

aaronlehmann commented Aug 30, 2017

mitchcapper commented Sep 7, 2017

vduglued commented Oct 14, 2017

p53 commented Nov 3, 2017 • edited

guillon commented Sep 28, 2018 • edited

dcui commented Apr 30, 2019

johannesboon commented Jun 8, 2019

guillon commented Jun 10, 2019

unilynx commented Nov 28, 2019

rwkarg commented Jun 22, 2020 • edited

leakingtapan commented Oct 26, 2020 • edited

ssup2 commented Nov 19, 2020

karunchennuri commented Dec 17, 2020 • edited

akerouanton commented Aug 7, 2023 • edited

p53 commented Nov 3, 2017 •

edited

guillon commented Sep 28, 2018 •

edited

rwkarg commented Jun 22, 2020 •

edited

leakingtapan commented Oct 26, 2020 •

edited

karunchennuri commented Dec 17, 2020 •

edited

akerouanton commented Aug 7, 2023 •

edited