Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent OOMs and High CPU Usage While Serving Eth Calls #22567

Closed
nisdas opened this issue Mar 24, 2021 · 18 comments
Closed

Frequent OOMs and High CPU Usage While Serving Eth Calls #22567

nisdas opened this issue Mar 24, 2021 · 18 comments

Comments

@nisdas
Copy link
Contributor

nisdas commented Mar 24, 2021

System information

Geth version: v1.10.1
OS & Version: Linux

Expected behaviour

After the geth node is synced it is able to serve and incoming eth_calls without any issues.

Actual behaviour

The geth node frequently OOMs the moment it starts serving eth_calls , with execution taking a very long time and eventually timing out. The node is immediately killed as it OOMs with memory usage spiking to from 1gb to 10 - 12 gb in a matter of seconds. This behaviour wasn't observed in earlier releases ( 1.9.25 and earlier).

Steps to reproduce the behaviour

Run geth with the following flags

      --v5disc
      --http
      --http.addr=0.0.0.0
      --http.corsdomain=*
      --http.vhosts=*
      --ws
      --ws.addr=0.0.0.0
      --ws.origins=*
      --datadir=/ethereum
      --keystore=/keystore
      --verbosity=3
      --light.serve=50
      --metrics
      --ethstats=/ethstats_secret
      --goerli
      --snapshot=false
      --txlookuplimit=0
      --cache.preimages
      --rpc.allow-unprotected-txs
      --pprof

We have reverted all the big changes from v1.10 onwards however it hasn't made a difference.

Backtrace

INFO [03-24|11:39:23.079] Loaded local transaction journal         transactions=3130 dropped=0
INFO [03-24|11:39:23.092] Regenerated local transaction journal    transactions=3130 accounts=536
WARN [03-24|11:39:23.092] Switch sync mode from fast sync to full sync 
INFO [03-24|11:39:23.140] Unprotected transactions allowed 
WARN [03-24|11:39:23.141] Old unclean shutdowns found              count=141
WARN [03-24|11:39:23.141] Unclean shutdown detected                booted=2021-03-24T10:59:08+0000 age=40m15s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:04:52+0000 age=34m31s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:10:32+0000 age=28m51s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:14:56+0000 age=24m27s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:18:12+0000 age=21m11s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:21:23+0000 age=18m
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:23:03+0000 age=16m20s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:24:49+0000 age=14m34s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:27:13+0000 age=12m10s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:31:47+0000 age=7m36s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:35:22+0000 age=4m1s
INFO [03-24|11:39:23.142] Allocated cache and file handles         database=/ethereum/geth/les.server        cache=16.00MiB  handles=16
INFO [03-24|11:39:23.174] Configured checkpoint oracle             address=0x18CA0E045F0D772a851BC7e48357Bcaab0a0795D signers=5 threshold=2
INFO [03-24|11:39:23.185] Loaded latest checkpoint                 section=136 head="cb1485…e62327" chtroot="4fb6c4…a2b521" bloomroot="443066…099a79"
INFO [03-24|11:39:23.185] Starting peer-to-peer node               instance=Geth/v1.10.1-stable-c2d2f4ed/linux-amd64/go1.16
INFO [03-24|11:39:23.276] New local node record                    seq=470 id=xxxxxxxxxxxx ip=127.0.0.1 udp=30303 tcp=30303
INFO [03-24|11:39:23.305] Started P2P networking                   self=enode://xxxxxxxx@127.0.0.1:30303
INFO [03-24|11:39:23.306] IPC endpoint opened                      url=/ethereum/geth.ipc
INFO [03-24|11:39:23.308] HTTP server started                      endpoint=[::]:8545 prefix= cors=* vhosts=*
INFO [03-24|11:39:23.308] WebSocket enabled                        url=ws://[::]:8546
INFO [03-24|11:39:23.308] Stats daemon started 
INFO [03-24|11:39:33.721] Looking for peers                        peercount=1 tried=35 static=0
INFO [03-24|11:39:37.922] Block synchronisation started 
INFO [03-24|11:39:38.849] New local node record                    seq=471 id=xxxxxxx ip=xxxxxxxxx udp=xxxxx tcp=30303
WARN [03-24|11:39:39.683] Served eth_call                          conn=127.0.0.1:49070 reqid=107580 t=6.568050388s err="execution aborted (timeout = 5s)"
WARN [03-24|11:39:43.317] Served eth_call                          conn=127.0.0.1:53254 reqid=1      t=10.147648933s err="execution aborted (timeout = 5s)"
INFO [03-24|11:39:43.987] Looking for peers                        peercount=1 tried=24 static=0
WARN [03-24|11:39:46.515] Served eth_call                          conn=127.0.0.1:35870 reqid=1      t="559.32µs"    err="execution reverted"
WARN [03-24|11:39:46.622] Served eth_call                          conn=127.0.0.1:60854 reqid=1      t="479.825µs"   err="execution reverted"
INFO [03-24|11:39:46.683] Submitted transaction                    hash=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from=xxxxxxxxxxxxxxxxxxxxx nonce=xxxx recipient=xxxxxxxxxxxxxxxxxx value=0
WARN [03-24|11:39:47.966] Served eth_call                          conn=127.0.0.1:42870 reqid=290669 t=10.325244195s err="execution aborted (timeout = 5s)"
WARN [03-24|11:39:48.460] Failed to retrieve stats server message  err="websocket: close 1006 (abnormal closure): unexpected EOF"
WARN [03-24|11:39:51.966] Served eth_estimateGas                   conn=127.0.0.1:60412 reqid=3971546 t=5.387361441s  err="execution aborted (timeout = 0s)"
WARN [03-24|11:39:54.181] Full stats report failed                 err="use of closed network connection"

These are the last logs before it gets killed due to an OOM. A restart doesn't help as it goes through this whole process again and gets killed in the next few minutes again while serving an eth_call . While it is expected for our geth node to have higher than normal memory usage due to serving public rpc requests, this hasn't come up before for us which is the reason this issue has been opened.

This is the heap profile of geth right before it gets killed.
geth_heap

This is the cpu profile right before it gets killed.

geth_profile

From both the above figures, it appears that serving these RPC requests causes great stress to the node, and large increases in memory usage due to encoding the response to the request.

These are the raw profiles if it will help you debug this further:

cpu_profile.pb.gz

heap_profile.pb.gz

@MariusVanDerWijden
Copy link
Member

We did some investigation of a similar issue here: #22529
Can you provide some example eth_calls that you are running on your node?

@nisdas
Copy link
Contributor Author

nisdas commented Mar 24, 2021

@MariusVanDerWijden These are unfortunately public rpc requests, we do not have direct control over them. The linked issue is interesting, we do have frequent eth_getLog requests also running through our endpoint. I tried running with trace logs, but it mostly showed eth_getTransactionByHash and eth_getTransactionReceipt in the debug logs, which matches with what has been shown in the above profile. Is there a flag that can be used to capture all the incoming rpc calls ?

@MariusVanDerWijden
Copy link
Member

Unfortunately there's no such flag. I'll take a look at your memdumps now

@MariusVanDerWijden
Copy link
Member

Yeah so what I think happens is that someone tries to use your node for fronrunning/backrunning. There are a lot of eth_getTransaction and eth_getTransactionReceipt calls. The return data for these calls are pretty huge and thus are compressed using gzip. Gzip internally uses a flate compression. This compression is pretty memory intensive, see also golang/go#32371. Since the GzipWriter is closed after every request, the memory for the flate compression has to be allocated anew on every request. I think can use writer.Flush() instead of write.Close which might improve memory consumption a bit, but I have to look into it further

@MariusVanDerWijden
Copy link
Member

Yeah nvm we can not use writer.Flush() instead of writer.Close()

@nisdas
Copy link
Contributor Author

nisdas commented Mar 24, 2021

Thanks for looking into it, that explains the constant OOMs that we do see. However, do you know if there was any PR in the update to v1.10.x that might have made this worse ? We werent having issues with v1.9.25, while this came up almost immediately while we updated to v1.10.1

@MariusVanDerWijden
Copy link
Member

MariusVanDerWijden commented Mar 24, 2021

Not sure, It could also be something that changed within the standard library between go versions.
As a super quick fix, you could drop all requests that have accept-encoding gzip in their header

So flate allocates 2^16 bytes on every call to initDeflate and another 4*2^15+1 bytes, resulting in 196612 bytes per init or ~192kb.
More memory is also retained by the writer object which has a compressor struct within that has two uint32 slices with 2^17 and 2^15 elements, thus another 655360 bytes.

ROUTINE ======================== compress/flate.NewWriter in compress/flate/deflate.go
  610.83MB   752.18MB (flat, cum) 42.00% of Total
         .          .    665:// compression efficiency.
         .          .    666://
         .          .    667:// If level is in the range [-2, 9] then the error returned will be nil.
         .          .    668:// Otherwise the error returned will be non-nil.
         .          .    669:func NewWriter(w io.Writer, level int) (*Writer, error) {
  610.83MB   610.83MB    670:	var dw Writer
         .   141.35MB    671:	if err := dw.d.init(w, level); err != nil {
         .          .    672:		return nil, err
         .          .    673:	}
         .          .    674:	return &dw, nil
         .          .    675:}
  138.85MB   138.85MB (flat, cum)  7.75% of Total
         .          .    367:	d.err = d.w.err
         .          .    368:	d.windowEnd = 0
         .          .    369:}
         .          .    370:
         .          .    371:func (d *compressor) initDeflate() {
   65.96MB    65.96MB    372:	d.window = make([]byte, 2*windowSize)
         .          .    373:	d.hashOffset = 1
   72.89MB    72.89MB    374:	d.tokens = make([]token, 0, maxFlateBlockTokens+1)
         .          .    375:	d.length = minMatchLength - 1
         .          .    376:	d.offset = 0
         .          .    377:	d.byteAvailable = false
         .          .    378:	d.index = 0
         .          .    379:	d.hash = 0

This means we allocate (only for compression) roughly 1 MB of data.

b.Run("gzip2", func(b *testing.B) {
			b.ReportAllocs()
			bb := &bytes.Buffer{}
			for i := 0; i < b.N; i++ {
				w := gzip.NewWriter(nil)
				w.Reset(bb)
				w.Write([]byte{0x00})
				w.Close()
			}
		})

BenchmarkCompress/writer/gzip2-16   	    6040	    186304 ns/op	  813977 B/op	      18 allocs/op

@nisdas
Copy link
Contributor Author

nisdas commented Mar 25, 2021

That would make sense, as a short term solution we can drop gzip requests. Although I looked at the implementation and we do use a sync pool to store the relevant gzip writers. So these writers are re-used for the most part or at least should be for the next few rpc requests (with gzip). The fact that even with the re-use of older writers, memory usage spikes very quickly seems odd. Is there any flag I can use to debug this further ?

@MariusVanDerWijden
Copy link
Member

MariusVanDerWijden commented Mar 25, 2021

Not really, but you can always dump the current memory with debug.writeMemProfile("file") in the console.
Yeah my hunch is that something prevents the GC from deleting the objects from the sync pool once they are not used anymore.

Edit: What would really help would be if we had more information about the calls that get executed

@MariusVanDerWijden
Copy link
Member

@nisdas I made a test fix here: https://github.com/MariusVanDerWijden/go-ethereum/tree/rpc-test maybe you can try running this and see if the memory blowup still persists

@nisdas
Copy link
Contributor Author

nisdas commented Mar 26, 2021

Edit: What would really help would be if we had more information about the calls that get executed

Let me see what I can do, we can try to capture all the incoming requests and see which one of them are responsible for the blowups.

@nisdas I made a test fix here: https://github.com/MariusVanDerWijden/go-ethereum/tree/rpc-test maybe you can try running this and see if the memory blowup still persists

Great thanks ! Will give this a shot

@prestonvanloon
Copy link
Contributor

Here's a brief log of requests made to our nodes. requests.log.gz These requests were captured across a span of less than 5 minutes. It seems that someone is abusing our endpoint to run swarm Bee nodes, or something similar to that.

@nisdas
Copy link
Contributor Author

nisdas commented Mar 31, 2021

@MariusVanDerWijden From the above posted logs, it does seem to be some users running a swarm bee node and requesting a lot of transactions by hash. Would the newly introduced change of transaction unindexing have an effect over here ? We initially ran our nodes without any special flags, but after having difficulties keeping it up we added in these flags :

--snapshot=false 
--txlookuplimit=0 
--cache.preimages 
--rpc.allow-unprotected-txs 

So the transaction re-indexing was introduced later.

@holiman
Copy link
Contributor

holiman commented Mar 31, 2021

From that request-log, 152108 calls to eth_getTransactionByHash, and 52706 to eth_getTransactionReceipt. If the txlookup does not cover the txhash, then it should exit pretty quickly -- but each one causes a db lookup.

@nisdas
Copy link
Contributor Author

nisdas commented Mar 31, 2021

That makes sense, would explain the largely increased CPU usage here. Each transaction hits the db, so leads to a db lookup + deserialization from RLP. Although not sure, why it only became an issue on v1.10.x for us, but from what we have captured it does look like the large amount of external rpc requests here are the main issue. We should be either dropping or rate limiting them. I will be closing the issue now unless something else pops up on our end. Thanks for helping us investigate this issue @MariusVanDerWijden @holiman .

@binhgo
Copy link

binhgo commented Jun 30, 2021

hey guys, any update for the CPU consuming here? I am also face the same issues, the CPU is very high, 80-90% of the server. I am using 8 core and 32gb RAM and 700SSD now

@nisdas do you have solution for this yet?

@prestonvanloon
Copy link
Contributor

@binhgo Our solution was to terminate our public API offering. The root cause was abuse of our free service for the goerli testnet.

@lurchinms
Copy link

I am running the same issue.....
ETH_Node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants