Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ongoing connection reset by peer #3175

Open
AbhiramDwivedi opened this issue Apr 17, 2024 · 5 comments
Open

Ongoing connection reset by peer #3175

AbhiramDwivedi opened this issue Apr 17, 2024 · 5 comments
Assignees
Labels
for/user-attention This issue needs user attention (feedback, rework, etc...) status/need-feedback type/bug A general bug

Comments

@AbhiramDwivedi
Copy link

AbhiramDwivedi commented Apr 17, 2024

Two of our microservices running Spring Boot deployed on AWS EKS keep running into intermittent errors with "connection reset by peer"

We have already applied #1774 (comment) and actually used shorter timeouts and evictions, but it does not help.

The problem does not happen when invocations are made from react applications to Spring Boot server, or from Spring Boot clients to non-reactor based microservices. It is possible that the problem is in Infrastructure, but AWS refuses to accept. As an essence, this is hard to replicate outside of "our" environment, or outside of individual environments that others have used and faced this in.

Expected Behavior

The subscriber should validate a connection before it uses it. If this is not the default, this should at least be an option. SO, reactor-netty are flood with issues like this, going on for years, it only makes sense to provide code level option that would work across scenarios.

Actual Behavior

Intermittent error:
Caused by: org.springframework.web.reactive.function.client.WebClientRequestException: recvAddress(..) failed: Connection reset by peer; nested exception is io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ? Request to GET http://application-URL [DefaultWebClient]
Original Stack Trace:
at org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:141)
at reactor.core.publisher.MonoErrorSupplied.subscribe(MonoErrorSupplied.java:55)

Independent of this, the server has following logs, that may / may not be related:

  • Last HTTP packet was sent, terminating the channel
  • Channel inbound receiver cancelled (subscription disposed).

Steps to Reproduce

Unable to replicate outside of our environment. In our environment too, this happens only when calls are made between Spring Boot applications running in two different EKS clusters. It does not happen when applications are running in same EKS cluster.

Possible Solution

  • Validate a connection before using it, or
  • Provide an option to disable connection pool

Your Environment

Spring Boot applications running in two different EKS clusters.

  • Reactor version(s) used: projectreactor:reactor-core:jar:3.4.33
  • Other relevant libraries versions (eg. netty, ...): reactor-netty-core:jar:1.0.38
  • JVM version (java -version): openjdk version "11.0.22" 2024-01-16 LTS, OpenJDK Runtime Environment (Red_Hat-11.0.22.0.7-1) (build 11.0.22+7-LTS)
  • OS and version (eg. uname -a): rhel 8
  • Spring Boot : spring-boot-starter-webflux:jar:2.7.17
@AbhiramDwivedi AbhiramDwivedi added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Apr 17, 2024
@violetagg violetagg self-assigned this Apr 17, 2024
@violetagg violetagg added for/user-attention This issue needs user attention (feedback, rework, etc...) and removed status/need-triage A new issue that still need to be evaluated as a whole labels Apr 17, 2024
@violetagg
Copy link
Member

@AbhiramDwivedi Have you checked https://projectreactor.io/docs/netty/release/reference/index.html#faq.connection-closed, especially the part where a Network Component drops a connection silently.
If you have checked that, please provide the TCP dump.
You might be interested in checking this https://medium.com/tenable-techblog/lessons-from-aws-nlb-timeouts-5028a8f65dda (in case you use AWS NLB) and this https://youtu.be/O4oZS-SAq14?t=526

Copy link

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@AbhiramDwivedi
Copy link
Author

Hi @violetagg : Those links were quite useful in understanding the TCP settings. We tried with pretty aggressive settings and that did not help. This is almost solved with TCP changes on target cluster, with following changes:

  • keep-alive reduced from 3000 to 300
  • keep-alive-requests increased from 100 to 1000
  • upstream-keepalive-timeout increased from 60 to 300
    However, we still run into it some times, and there is no real consistent way of reproducing or solving this.

A hundreds of million project was delayed due to this, and is now live with known intermittent issues. All network and dev teams have exhausted their capacities. Sometimes, its OK to move on than staying stuck to solve.

For a case like this, or other future cases, I would expect project developers to create an option to kill the pool and act as resttemplate. We are probably going to make that code change anyway at our end, and use two different ways of invoking endpoints.

This bug is not about "my" issue, but rather a permanent solution

@violetagg
Copy link
Member

@AbhiramDwivedi You changed the timeouts on the target but did you add any configuration on your client e.g. maxIdleTime as it is suggested in our FAQ?

Copy link

github-actions bot commented May 7, 2024

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
for/user-attention This issue needs user attention (feedback, rework, etc...) status/need-feedback type/bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants