Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket hang up (ECONNRESET) - Web3js #27859

Closed
dancamarg0 opened this issue Sep 17, 2022 · 8 comments · Fixed by #29130
Closed

socket hang up (ECONNRESET) - Web3js #27859

dancamarg0 opened this issue Sep 17, 2022 · 8 comments · Fixed by #29130
Assignees
Labels
community Community contribution javascript Pull requests that update Javascript code web3.js Related to the JavaScript client

Comments

@dancamarg0
Copy link

dancamarg0 commented Sep 17, 2022

Problem

We at Triton One have seen many developers facing an error like this when using the web3js:

FetchError: request to https://client.rpcpool.com/ failed, reason: socket hang up
    at ClientRequest.<anonymous> (/home/ec2-user/processes/dex-webserver-mainnet-multi/node_modules/node-fetch/lib/index.js:1491:11)
    at ClientRequest.emit (node:events:527:28)
    at ClientRequest.emit (node:domain:475:12)
    at TLSSocket.socketOnEnd (node:_http_client:478:9)
    at TLSSocket.emit (node:events:539:35)
    at TLSSocket.emit (node:domain:475:12)
    at endReadableNT (node:internal/streams/readable:1345:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21) {
  type: 'system',
  errno: 'ECONNRESET',
  code: 'ECONNRESET'
}

I've been collecting some tcp dump from our servers and I can see in the vast majority of times this is caused due to an RST packet sent by our HAproxy Loadbalancer which abruptly closes the connection in the client side, see this screenshot as an example.
image

IP: 204.16.246.170 (Load Balancer managed by Triton)
IP: 18.237.101.162 (Client)

  1. Notice the Load Balancer first sends a FIN flag indicating to the client the socket will close.
  2. Shortly afterwards the client attempts to PUSH data to a read/write-closed socket.
  3. The server responds with a TCP RST flag.
  4. Nodejs handles this abrupt 'disconnection' with the error above.

This seems to be a common issue across many nodejs applications when I search through stackoverflow.com. While the client can simply ignore it and re-connect, this raises complains from our customers that are expecting to extract maximum read performance from our servers.

Proposed Solution

Here's a few proposed solutions:

  1. Remove HTTP keep-alive functionality completely from Web3js so it closes sockets as soon as the client gets a response.

  2. Enforce client-side timeouts in the http keep-alive settings, e.g: https://github.com/solana-labs/solana-web3.js/blob/master/src/agent-manager.ts#L13 could be set as {keepAlive: true, maxSockets: 25, timeout: 30000}; (30s) or shorter?

Note: 2) won't likely solve the issue completely but potentially reduce the error rate as the errors don't happen in a fixed interval, it varies on every application. Some customers see it every X minutes while others see it pretty much every few seconds. So here i'm proposing that the client closes the socket before the LB does to avoid the client abruptly closing the connection

  1. Destroy socket just after sending a new request, see this interesting discussion on nodejs repo Make it possible to forcibly RST a net.Socket nodejs/node#27428. I can see at the bottom folks have created a PR in May attempting to fix this, it may be good for Node developers to have a look.
@dancamarg0 dancamarg0 added the community Community contribution label Sep 17, 2022
@steveluscher steveluscher self-assigned this Sep 17, 2022
@steveluscher steveluscher added the javascript Pull requests that update Javascript code label Sep 17, 2022
@0xCactus
Copy link

Bump on this as Solend often sees this error

@steveluscher steveluscher added the web3.js Related to the JavaScript client label Dec 2, 2022
@y2kappa
Copy link

y2kappa commented Dec 5, 2022

Bump also, Hubble and Kamino have oracle staleness issues due to this.

@steveluscher
Copy link
Contributor

Love it. I'll dig into this, this week.

@steveluscher
Copy link
Contributor

K, here's what I think I've learned from this excellent article on tuning keep-alive.

  • The underlying HTTP library that the Solana RPC uses (hyper) has a default keep-alive timeout of 20s.
  • Typical Node.js servers have a default keep-alive timeout of 5s.
  • When the RPC is behind a load balancer, a higher ‘free socket timeout’ in the load balancer can result in the RPC closing the socket, but the load balancer (ergo, the client) thinking that it's still open. The next request will fail.

image

I believe the solutions to be as follows:

  1. Let people supply their own agents or disable agents altogether should they like to do some tuning feat: you can now supply your own HTTP agent to a web3.js Connection #29125.
  2. Reduce the timeout of our default agent to the Solana RPC's timeout, minus one second (20s - 1s = 19s) fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130.
  3. RPC providers should maybe do the same – setting their load balancer timeouts to 1 second less than the Solana RPC's timeout (20s - 1s = 19s).

Let's discuss on over at #29130.

@gallynaut
Copy link

@steveluscher Switchboard is having better performance with this version of web3.js

Thanks for getting this fixed. Will report back if anything changes.

@steveluscher
Copy link
Contributor

Rad. What exactly does better look like in your case @gallynaut?

@gallynaut
Copy link

We monitor event loop health for our oracles. With the ECONNRESET issue the oracles would be blocked from 1s to 2min which caused some feeds to be stale. With this patch we no longer see the event loop blocked warnings.

@steveluscher
Copy link
Contributor

Yaaaas. This is great news. @gallynaut, can you check out this discussion from another team that's having some success with this patch? I'm curious to know how your setup is structured, and what the keep-alive timeouts are configured to at every step in the network (the client is now 19s, your load balancer is ???, and presumably your RPC endpoint is the Solana official RPC which is set to 20s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution javascript Pull requests that update Javascript code web3.js Related to the JavaScript client
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants