Ongoing connection reset by peer #3175

AbhiramDwivedi · 2024-04-17T20:42:46Z

Two of our microservices running Spring Boot deployed on AWS EKS keep running into intermittent errors with "connection reset by peer"

We have already applied #1774 (comment) and actually used shorter timeouts and evictions, but it does not help.

The problem does not happen when invocations are made from react applications to Spring Boot server, or from Spring Boot clients to non-reactor based microservices. It is possible that the problem is in Infrastructure, but AWS refuses to accept. As an essence, this is hard to replicate outside of "our" environment, or outside of individual environments that others have used and faced this in.

Expected Behavior

The subscriber should validate a connection before it uses it. If this is not the default, this should at least be an option. SO, reactor-netty are flood with issues like this, going on for years, it only makes sense to provide code level option that would work across scenarios.

Actual Behavior

Intermittent error:
Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: recvAddress(..) failed: Connection reset by peer; nested exception is io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ? Request to GET http://application-URL [DefaultWebClient]
Original Stack Trace:
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)

Independent of this, the server has following logs, that may / may not be related:

Last HTTP packet was sent, terminating the channel
Channel inbound receiver cancelled (subscription disposed).

Steps to Reproduce

Unable to replicate outside of our environment. In our environment too, this happens only when calls are made between Spring Boot applications running in two different EKS clusters. It does not happen when applications are running in same EKS cluster.

Possible Solution

Validate a connection before using it, or
Provide an option to disable connection pool

Your Environment

Spring Boot applications running in two different EKS clusters.

Reactor version(s) used: projectreactor:reactor-core:jar:3.4.33
Other relevant libraries versions (eg. netty, ...): reactor-netty-core:jar:1.0.38
JVM version (java -version): openjdk version "11.0.22" 2024-01-16 LTS, OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
OS and version (eg. uname -a): rhel 8
Spring Boot : spring-boot-starter-webflux:jar:2.7.17

The text was updated successfully, but these errors were encountered:

violetagg · 2024-04-17T21:01:27Z

@AbhiramDwivedi Have you checked https://projectreactor.io/docs/netty/release/reference/index.html#faq.connection-closed, especially the part where a Network Component drops a connection silently.
If you have checked that, please provide the TCP dump.
You might be interested in checking this https://medium.com/tenable-techblog/lessons-from-aws-nlb-timeouts-5028a8f65dda (in case you use AWS NLB) and this https://youtu.be/O4oZS-SAq14?t=526

github-actions · 2024-04-25T06:26:43Z

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

AbhiramDwivedi · 2024-04-26T21:24:15Z

Hi @violetagg : Those links were quite useful in understanding the TCP settings. We tried with pretty aggressive settings and that did not help. This is almost solved with TCP changes on target cluster, with following changes:

keep-alive reduced from 3000 to 300
keep-alive-requests increased from 100 to 1000
upstream-keepalive-timeout increased from 60 to 300
However, we still run into it some times, and there is no real consistent way of reproducing or solving this.

A hundreds of million project was delayed due to this, and is now live with known intermittent issues. All network and dev teams have exhausted their capacities. Sometimes, its OK to move on than staying stuck to solve.

For a case like this, or other future cases, I would expect project developers to create an option to kill the pool and act as resttemplate. We are probably going to make that code change anyway at our end, and use two different ways of invoking endpoints.

This bug is not about "my" issue, but rather a permanent solution

violetagg · 2024-04-30T05:11:55Z

@AbhiramDwivedi You changed the timeouts on the target but did you add any configuration on your client e.g. maxIdleTime as it is suggested in our FAQ?

github-actions · 2024-05-07T06:27:54Z

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

AbhiramDwivedi added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Apr 17, 2024

violetagg self-assigned this Apr 17, 2024

violetagg added for/user-attention This issue needs user attention (feedback, rework, etc...) and removed status/need-triage A new issue that still need to be evaluated as a whole labels Apr 17, 2024

github-actions bot added the status/need-feedback label Apr 25, 2024

github-actions bot removed the status/need-feedback label Apr 27, 2024

github-actions bot added the status/need-feedback label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ongoing connection reset by peer #3175

Ongoing connection reset by peer #3175

AbhiramDwivedi commented Apr 17, 2024 •

edited

violetagg commented Apr 17, 2024

github-actions bot commented Apr 25, 2024

AbhiramDwivedi commented Apr 26, 2024

violetagg commented Apr 30, 2024

github-actions bot commented May 7, 2024

Ongoing connection reset by peer #3175

Ongoing connection reset by peer #3175

Comments

AbhiramDwivedi commented Apr 17, 2024 • edited

Expected Behavior

Actual Behavior

Steps to Reproduce

Possible Solution

Your Environment

violetagg commented Apr 17, 2024

github-actions bot commented Apr 25, 2024

AbhiramDwivedi commented Apr 26, 2024

violetagg commented Apr 30, 2024

github-actions bot commented May 7, 2024

AbhiramDwivedi commented Apr 17, 2024 •

edited