New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent OOMs and High CPU Usage While Serving Eth Calls #22567
Comments
We did some investigation of a similar issue here: #22529 |
@MariusVanDerWijden These are unfortunately public rpc requests, we do not have direct control over them. The linked issue is interesting, we do have frequent |
Unfortunately there's no such flag. I'll take a look at your memdumps now |
Yeah so what I think happens is that someone tries to use your node for fronrunning/backrunning. There are a lot of |
Yeah nvm we can not use |
Thanks for looking into it, that explains the constant OOMs that we do see. However, do you know if there was any PR in the update to v1.10.x that might have made this worse ? We werent having issues with v1.9.25, while this came up almost immediately while we updated to v1.10.1 |
Not sure, It could also be something that changed within the standard library between go versions. So flate allocates 2^16 bytes on every call to
This means we allocate (only for compression) roughly 1 MB of data.
|
That would make sense, as a short term solution we can drop gzip requests. Although I looked at the implementation and we do use a sync pool to store the relevant gzip writers. So these writers are re-used for the most part or at least should be for the next few rpc requests (with gzip). The fact that even with the re-use of older writers, memory usage spikes very quickly seems odd. Is there any flag I can use to debug this further ? |
Not really, but you can always dump the current memory with Edit: What would really help would be if we had more information about the calls that get executed |
@nisdas I made a test fix here: https://github.com/MariusVanDerWijden/go-ethereum/tree/rpc-test maybe you can try running this and see if the memory blowup still persists |
Let me see what I can do, we can try to capture all the incoming requests and see which one of them are responsible for the blowups.
Great thanks ! Will give this a shot |
Here's a brief log of requests made to our nodes. requests.log.gz These requests were captured across a span of less than 5 minutes. It seems that someone is abusing our endpoint to run swarm Bee nodes, or something similar to that. |
@MariusVanDerWijden From the above posted logs, it does seem to be some users running a swarm bee node and requesting a lot of transactions by hash. Would the newly introduced change of transaction unindexing have an effect over here ? We initially ran our nodes without any special flags, but after having difficulties keeping it up we added in these flags :
So the transaction re-indexing was introduced later. |
From that request-log, |
That makes sense, would explain the largely increased CPU usage here. Each transaction hits the db, so leads to a db lookup + deserialization from RLP. Although not sure, why it only became an issue on v1.10.x for us, but from what we have captured it does look like the large amount of external rpc requests here are the main issue. We should be either dropping or rate limiting them. I will be closing the issue now unless something else pops up on our end. Thanks for helping us investigate this issue @MariusVanDerWijden @holiman . |
hey guys, any update for the CPU consuming here? I am also face the same issues, the CPU is very high, 80-90% of the server. I am using 8 core and 32gb RAM and 700SSD now @nisdas do you have solution for this yet? |
@binhgo Our solution was to terminate our public API offering. The root cause was abuse of our free service for the goerli testnet. |
System information
Geth version:
v1.10.1
OS & Version:
Linux
Expected behaviour
After the geth node is synced it is able to serve and incoming
eth_calls
without any issues.Actual behaviour
The geth node frequently OOMs the moment it starts serving
eth_calls
, with execution taking a very long time and eventually timing out. The node is immediately killed as it OOMs with memory usage spiking to from 1gb to 10 - 12 gb in a matter of seconds. This behaviour wasn't observed in earlier releases ( 1.9.25 and earlier).Steps to reproduce the behaviour
Run geth with the following flags
We have reverted all the big changes from v1.10 onwards however it hasn't made a difference.
Backtrace
These are the last logs before it gets killed due to an OOM. A restart doesn't help as it goes through this whole process again and gets killed in the next few minutes again while serving an
eth_call
. While it is expected for our geth node to have higher than normal memory usage due to serving public rpc requests, this hasn't come up before for us which is the reason this issue has been opened.This is the heap profile of geth right before it gets killed.
This is the cpu profile right before it gets killed.
From both the above figures, it appears that serving these RPC requests causes great stress to the node, and large increases in memory usage due to encoding the response to the request.
These are the raw profiles if it will help you debug this further:
cpu_profile.pb.gz
heap_profile.pb.gz
The text was updated successfully, but these errors were encountered: