Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket hang up #1426

Closed
aaronovz1 opened this issue Jul 23, 2023 · 9 comments
Closed

socket hang up #1426

aaronovz1 opened this issue Jul 23, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@aaronovz1
Copy link

Overview

We keep getting these occasional socket hang up errors which seems to cause our backend job to hang.

Steps to reproduce

This is difficult to reproduce but the job which triggers it often is pretty heavy on async calls with many RPC requests firing off in parallel using the same Connection object.

Description of bug

FetchError: request to https://rpcpool.com/[REDACTED] failed, reason: socket hang up
 at ClientRequest.<anonymous> (/usr/app/node_modules/node-fetch/lib/index.js:1491:11)
 at ClientRequest.emit (node:events:513:28)
 at TLSSocket.socketOnEnd (node:_http_client:526:9)
 at TLSSocket.emit (node:events:525:35)
@aaronovz1 aaronovz1 added the bug Something isn't working label Jul 23, 2023
@aaronovz1
Copy link
Author

Possibly issue with outdated node-fetch as I'm seeing these errors as well after updating to 1.78
node-fetch/node-fetch#1219

@steveluscher
Copy link
Collaborator

Try configuring your own keep-alive timeout (with advice from your RPC provider) as mentioned here, and report back!

@aaronovz1
Copy link
Author

Thanks @steveluscher - chatting to Triton now.

What do you think about the these issues causing code to hang though? Smells like a secondary issue with error handling somewhere in either this package or node-fetch.

@steveluscher
Copy link
Collaborator

steveluscher commented Jul 24, 2023

I drew diagrams over at solana-labs/solana#29130!

@steveluscher
Copy link
Collaborator

Unless your question is ‘why isn't the infra resilient, even in the face of this connection tracking bug’ to which my answer is: if you can't trust the connection tracking code, all bets are basically off.

@aaronovz1
Copy link
Author

aaronovz1 commented Jul 25, 2023

Unless your question is ‘why isn't the infra resilient, even in the face of this connection tracking bug’ to which my answer is: if you can't trust the connection tracking code, all bets are basically off.

No, I'm asking why does the Solana web3 Connection seemingly cause the code path using it to hang indefinitely. The issue we're having is we first see the socket hang up issue, then shortly after the process/job hangs forever. I don't see how your diagram explains that - looks more like it should timeout and throw some kind of error, not hang.

I guess my concern is there is another issue at play which is triggered by issues such as socket hang up, and Connection doesn't handle it well in an async environment. ie: we fix socket hang up issue with the keep-alive settings etc but some other connection problem occurs and we're sitting hung again.

@steveluscher
Copy link
Collaborator

steveluscher commented Jul 26, 2023

It's been 6 months since I've had all the context loaded into my head, but here's what I remember:

  • The client can only rely on what the proxy tells it. The proxy (load balancer in Triton's case) says the socket is open.
  • What the proxy doesn't know is that the RPC has since gone away. The socket is, in fact, not healthy.
  • The client can't tell the difference between there being a network error because of a transient issue, or there being a network error because the proxy and the RPC broke up, so it just keeps retrying according to the retry logic. This to you manifests as an infinite retry loop (ie. a hang).

If we start guessing why the network is emitting errors, we will be wrong some amount of the time. The solution is either:

  1. For the HTTP Agent on the client, the load balancer, and the RPC server to all agree on a keep-alive timeout
  2. You can set httpAgent to false on the client to just give up on keep-alive altogether

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

Because there has been no activity on this issue for 7 days since it was closed, it has been automatically locked. Please open a new issue if it requires a follow up.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants