Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130

Merged
merged 5 commits into from Dec 19, 2022
Merged

fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130

merged 5 commits into from Dec 19, 2022

Conversation

steveluscher
Copy link
Contributor

@steveluscher steveluscher commented Dec 6, 2022

Problem

When contacting an RPC that's behind a load balancer, clients can often send an RPC request down a free socket, only to discover that socket has since been disposed of. These requests will fail.

Based on this excellent article on tuning keep-alive, this is what I think I've learned.

  • The underlying HTTP library that the Solana RPC uses (hyper) has a default keep-alive timeout of 20s.
  • Typical Node.js servers have a default keep-alive timeout of 5s.
  • When the RPC is behind a load balancer, a higher ‘free socket timeout’ in the load balancer can result in the RPC closing the socket, but the load balancer (ergo, the client) thinking that it's still open. The next request will fail.

image

I believe the solutions to be as follows:

  1. Let people supply their own agents or disable agents altogether should they like to do some tuning feat: you can now supply your own HTTP agent to a web3.js Connection #29125.
  2. Reduce the timeout of our default agent to the Solana RPC's timeout, minus one second (20s - 1s = 19s) fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130.
  3. RPC providers should maybe do the same – setting their load balancer timeouts to 1 second less than the Solana RPC's timeout (20s - 1s = 19s).

Summary of Changes

  • Replaced our custom agent implementation with the agentkeepalive module.
  • Set the free socket timeout to the Solana RPC's timeout minus one second (20s - 1s = 19s)

Fixes #27859, hopefully.

@github-actions github-actions bot added the web3.js Related to the JavaScript client label Dec 6, 2022
@steveluscher steveluscher marked this pull request as ready for review December 6, 2022 22:29
@codecov
Copy link

codecov bot commented Dec 6, 2022

Codecov Report

Merging #29130 (9904db5) into master (f1427dd) will decrease coverage by 0.1%.
The diff coverage is 33.3%.

@@            Coverage Diff            @@
##           master   #29130     +/-   ##
=========================================
- Coverage    76.7%    76.5%   -0.2%     
=========================================
  Files          55       54      -1     
  Lines        3140     3119     -21     
  Branches      472      468      -4     
=========================================
- Hits         2410     2388     -22     
- Misses        565      567      +2     
+ Partials      165      164      -1     

@steveluscher
Copy link
Contributor Author

Thoughts, @dancamarg0, @linuskendall, @brianlong?

@steveluscher
Copy link
Contributor Author

@0xCactus, @y2kappa, would you like a strategy to test this out before I land it? You can either patch in the changes yourself, or I can publish a release candidate package just for you, that you can add to your package.json.

@0xCactus
Copy link

0xCactus commented Dec 7, 2022

@steveluscher yup i can help test it out. Would prefer a release candidate package

@steveluscher
Copy link
Contributor Author

It's all yours, @0xCactus. You can yarn add @solana/web3.js@pr-29130.

@0xCactus
Copy link

0xCactus commented Dec 7, 2022

getting the following error @steveluscher

{"level":50,"time":1670443877071,"pid":28,"hostname":"4169baa11267","service":"monitor","source":"pino ","error":{},"type":"account","msg":"monitor crashed: Error: failed to get info about account H6ARHf6YXhGYeQfUzQNGk6rDNnLBQKrenN712K4AQJEG: TypeError [ERR_INVALID_PROTOCOL]: Protocol \"https:\" not supported. Expected \"http:\""}

@steveluscher
Copy link
Contributor Author

Damnit. My bad. I'll publish a new package for you in a moment.

@steveluscher
Copy link
Contributor Author

Sorry for the delay @0xCactus. Try yarn add @solana/web3.js@1.70.0-pr-29130-2

@steveluscher
Copy link
Contributor Author

Alright, @0xCactus has this running in production at the moment. Let's give it a few days and see if there's improvement. Anyone else is free to pull down that version and give it a shot too!

@0xCactus
Copy link

@steveluscher fyi our service has been relatively more stable since switching to use the change from this PR. Though it didn't seem to completely fix the issue. Yesterday, @dancamarg0 tweaked the keep-alive on their server to 19s and we saw the socket hang up issue occurring on the client side consistently every minute

@dancamarg0
Copy link

Just to point out the current keep-alive timeout is set to 60s, that's the setting we've been running for weeks for @0xCactus

@steveluscher
Copy link
Contributor Author

Wait, so here's what I think I got from the two messages above:

  • Triton One did not change the keep-alive timeout on your RPC; it's still 60s.
  • That notwithstanding, you were running smoothly until yesterday when the hangup issue re-appeared, worse than before.

Are those two statements accurate?

@dancamarg0
Copy link

  1. Yes. Triton One did not change the keep-alive timeout on your RPC; it's still 60s.

  2. He was running smoothly until yesterday where I changed the keep-alive timeout from 60s to 19s as per your recommendation

So a higher timeout actually seems to fix the issue after your patch apparently? I'll also collect some TCP data to see how often we see RST packets in 0xcactus RPC now and get back if I have any interesting data

@steveluscher
Copy link
Contributor Author

So a higher timeout actually seems to fix the issue after your patch apparently?

Interesting! When it comes right down to it, the goal is for the client's timeout to be just a bit less any of the other timeouts in the chain, so that the client always gives up on the free connection before any of the middle or end pieces do.

Maybe there's a chance that if both the client and the load balancer are set to 19s that there's a small window of time in which the load balancer is dead but the client thinks it's still alive. You probably don't want to keep doing my testing in production for me, but it would be really interesting if you set the load balancer to match the RPC server @dancamarg0 (20s) and see if that performs just as well as 60s.

@steveluscher
Copy link
Contributor Author

Alright. I'm shipping this. Thanks for everyone's contributions here.

  1. Customers saw success with this change and particular configurations of their RPC endpoints, success being the disconnection rate falling to zero.
  2. For those whose RPC endpoints configuration is still incompatible with these defaults now have the option to supply their own, or to turn off keep-alive altogether, using feat: you can now supply your own HTTP agent to a web3.js Connection #29125

@steveluscher steveluscher merged commit 456a819 into solana-labs:master Dec 19, 2022
@steveluscher steveluscher deleted the set-upper-limit-on-agent-keepalive branch December 19, 2022 18:22
nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Jan 4, 2023
…Solana RPC's keep-alive timeout (solana-labs#29130)

* Delete `AgentManager`

* Replace custom `http.Agent` implementation with `agentkeepalive` package

* Set the default free socket timeout to 1s less than the Solana RPC's default timeout

* Add link to particular issue comment

* Create the correct flavor of default agent for http/https
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
web3.js Related to the JavaScript client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

socket hang up (ECONNRESET) - Web3js
3 participants