Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout from geth #25809

Closed
schonex opened this issue Sep 18, 2022 · 9 comments
Closed

Timeout from geth #25809

schonex opened this issue Sep 18, 2022 · 9 comments
Labels

Comments

@schonex
Copy link

schonex commented Sep 18, 2022

System information

Geth version: 1.10.25
OS & Version: Ubuntu 22.04

Expected behaviour

Prysm runs into sporadic timeouts, about 3 such events a day with about 3 requests per event. I should note that for a full 24 hours before moving geth from running on one disk array, to another this didn't seem to happen. Hard to pinpoint if the disk array has changed something or if the network just causes geth to experience a longer processing time.

Geth is running with a cache size of 12288
No errors reported on geth or any non standard logs around the time of this message

Actual behaviour

*Time in UTC
time="2022-09-18 19:38:20" level=error msg="Could not reconstruct full bellatrix block batch from blinded bodies" error="could not fetch execution blocks with txs by hash [0x83eda8c5d447c958100c731fd38c6bca6ddf01e0cd620879050db3c3d5a97cf6 0x40131025b7b98c9959228abce59f9ae891aabcc91f9783575d6e4db6ac4dc8a6 0x392bada0a68332c4cd7c7a84d78190dbc2a5563ad2f94d62422ac1bc5187a8e1 0xcda32c2076e6d69e5f966a870607701c3d87bd2dc672c02bf8ab5e617b364f78 0xf881eece58168bdc213c380805ee9ca7566505e4dc1b729f735ced23a740cdc4 0x0001c7b415467fa8702b7653c315dc055cb4df36cade242bbe9aad0ff83ebe69 0xe2c1d566de2a8bcc214e0c6686f530f74974deec456e98240987b3ff969c4a08 0x82be204da0acd634148481d0a7e103170eb8ea3269bf51cc809d3aa346329252 0xebad689fcd611bdc5b4713b23a3beb54b07bab366c3501188a068142735ce765 0xfb3e47dd6ccac9ab8dbf14a9b7ab1ea12629660fe10fcd12bbfd8bde62dfcfba 0xf573ddde4dcafdfd8da260fa0c434842a30d5ede5281984372a74e5d209d5b51 0x8ba0b948dc48f6cdbbee0906bab5272f74cfe0f1632bb840f5be1de1eb29e589 0x660eb545caefd0c44bc61315372c8539b961cf915e5009868c55601cc724a82d 0xd8605318e88d475a2e5c8fc5833c43a8cc8ec322b76c7b57090fd8461fbd7caa 0xd2a45560f6d782db801e71781f5c701dab9ccf3d85b49a118310b4ea8373ab97 0x472dd188300c384860c5c896fe0b299904367766d6348d0c34597e34bb3cb98d 0x06510dd32153a466641a8c64013c2079a3e7525dd44da35e1d2d8b60fc40d0db 0x1c93e189c3eb57b73d46083c91cc7dcfd6272fb076746a5a21141749cdd682b4 0xc8b6ecd62c7d65c89e501f5116946c5f9d014fc33490e7498e6dd38467b7c744 0x03ebfa7eb1d3bd64dd71d9bad38c0238e7dfcc5994f402ddc9dc67bdb28fbce3 0x022bf80dafe9744f5071ddd816579b19eebab6e60958dd43559fd8370d9b9c46 0x946b0362a3fc2639d5d75d49cda0a487c85603d09f8969cd53328c2daeba162c 0x1fdb5f3e4f265e15f6e16e60e96f4c95bdaa9bf8715de90fc1ad60966da5a6df 0x050f4c3adcc38994df59280036f6090e4d93d4786e19b53183888f71004775ed 0xad656d7c857064e4a23eccb16d75db6058fb1e61e7e24c0abcae73e8a526ad60 0xd903de2ed6c0ad470b9d5940faf1a5273c7e35c4c4afa8df8d86fb35ca1c8e3c 0x12667d5bd31106914a6236479d51a910e94c1a74113bc7f8768671277a5aab4f 0x7892a2c2c363383ea578bc0f26e99ebfbec77dbef09936137af1280596e05ff9 0xbfe9fc9b4cc5ba2c64cd0e00694c89d3682a36ea970b364e1c7f4be72f1d9f12 0xccc1f2ee2e7b7075f37666b9b25de173e43dd179455fd1db8915a5ad7e7d1495 0x3fc609acb0ea63e46e434fb5b3396db9798d3492818c6c8b97a07a97d729c503 0x6d9567615abc0701ef617741ab19a2e7f7915717fbf527f342c6811e56ae4233 0x4110dcd31376025becae155e052270644c21f7d4f8d0bbc216f925dfc98df282 0x3e43aa5c77de34559eb6c2b5899071ec5fd7fc893436f64699cb5a05e87cc6af 0x3cd2bd8e3e6028c8e9cac662139438b15569b401ebdf6592bba75aa53f59797e 0xdb59521a9d95a117eedc2e5409a6c24bfca77bff7a6a2921eb478536a43ebbdb 0x8cb1f881535c583817a95d533bae0c5664453d6cd1d1e03d908b8a7524e8e58b 0x3b2dad862b4757d6e531d66e9195d3deacf7f5c985235e41da92e86c74cfdbfb 0x54556b6c5e97ff5de92803ecfcc52ce43d79c544930596f508c1d878cfe5224b 0x25d64d0fc99c654776161c5ce93f91518dc1c8ca0d1759d22390e07ecc87466e 0x4ca1167e83ebc2bd278c484b603df33279bfd94556dcd76857970f876f62703b 0xb2f15094244e3a8594a69f946788c32fdd5b9cf43c05a768a92bf4035a828a27 0x915659451be69b183ed85c2b7608ab10d85815984dbd5b39c38436f121cdd0e8 0x144ddd6f1da40c774022dd1076ad77d4b92303b52c043f6d57d081270c07f386 0xec57ca2186d392bf2f318221d74f657d063e47484f4129bed9956d93b0f14081 0xaee79a7fd50da818fec6903b226fb8ac053c179aa8ef7a95460fcc9927bfbabe 0xf649b91e69def914165854d8172ed36631ca49448461546de878ba5c88c28d0a 0x9167c9442f21c0f3a4d48e1f1bfa38f71895ea0418281dc66a15cd22f725ddff 0x090faa53c0b838434f8b52ace33ca74e013917e824c6a11fa68de87ae674e224 0x2405edbc791a3dd80a004e09f1981d90337acaba32d2918cb78c5e12fcbac4e3 0x88f4047359ef54d1a8d6e3af9420e4ba5d26582d2cfe3db945dc4cf7fabe9755 0x552404e44a51c1ed190f75972490596741267cefa316a51d31843536d013feaa 0x018b5cbb78809ba7721dec0b370cc1b72a09859875ca894d43918160d0fe7312 0x0ea95a5a99f3f4a168c98a713b60e9da878db39e753720bd378c93e6dc964159 0xd6e72fad6ffeafe98368cdef253a68b26b0722f512a59c1a51362747026ed504 0xcc33c56a4cf67ad6ccbd010169d81dcaddcc3ba947297bec9cda7eac8a66c13c 0xfd9c02a74efa430fe3bd4ec6b3bd0919d33038375744af2cf94c18c64035c77c 0x83e2c5cad3b41c85807dd04e11d1cf7d8a7370c85e972a9c8277cdb54af0c77d 0x8ff00c273df9b4a709e7f80949d7895c4fc64aa3441eac3fbd5860327b7f98dc 0x615a0c31cbcac2e1e277f26788ce0d3f300c6408bfe7bc55ab095205e3922d6c 0x0b010da982d9fd2d87077e3d470a2c232b798e954353012869a44bb5c5525f11 0xdd555d629e91a82aeb31f9942be43d5ab3ac149a0cfe2baeca0b1d0f43c34abb 0xf8addbeeeca63550e2e1f8604adbf3bdeee41e83a3f0283a49126002319a14d0 0xb9ed534b32def74765761c512442b40148634a57be9479d6ad5ee42cc28d65aa]: Post "http://ip_of_geth_machine:8551": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" prefix=sync

@rjl493456442
Copy link
Member

I am not sure what's the time allowance for prysm. Maybe better to ask prysm how to handle timeout and if there is any retry mechanism in case?

@schonex
Copy link
Author

schonex commented Sep 19, 2022

The point of this bug isn't about Prysm incurring the timeout, its the fact those timeout started occuring with a slight change of environment so the question is where are the bottlenecks. As an aside, these exact same timeouts are experienced to a much much higher degree by Besu. So question #2 would be, is there a connection here between the two.

Reference hyperledger/besu#4400

@karalabe
Copy link
Member

What operation are you running? What call times out? You haven't really given us any infos/logs from Geth, so it's very hard to even start thinking where things might be wrong

@lukaszsamson
Copy link

I'm seeing the same issue. I'm running geth with prysm on separate instances. Both are fully synchronised and sufficiently provisioned. The error happens ~100 times/d. No relevant entries in geth log. I can't exclude a network partition but it's happening too often.

@schonex
Copy link
Author

schonex commented Oct 1, 2022

What operation are you running? What call times out? You haven't really given us any infos/logs from Geth, so it's very hard to even start thinking where things might be wrong

I'd be happy to provide anything you need, if you explain further how to get it. My normal geth logs show no errors, nothing out of the usual block header fetching at the time before, during and after those errors are presented on the prysm beacon chain client.

I assume Prysm makes its normal requests, and geth takes too long to answer being hung up on doing something else.

I am not running anything special at the time this happens. And like @lukaszsamson mentioned, this is happening with more regularity than i'd hope for. Not yet enough to affect vote performance for my consensus validator but enough to ask a question here about what's causing these.

@MariusVanDerWijden
Copy link
Member

I think the issue is compaction. NewPayload will wait for a compaction cycle to finish and might timeout. (We merged a fix into master that makes the waiting a bit better)

@holiman
Copy link
Contributor

holiman commented Oct 3, 2022

(We merged a fix into master that makes the waiting a bit better)

I don't think it will, if you mean the mutex? It just makes it so the queued-up NewPayload updates are not processed multiple times.

I'd be happy to provide anything you need, if you explain further how to get it. My normal geth logs show no errors

You are reporting issues using prysm-output. We don't really know prysm that much in depth, so have no idea what prysm was doing when it encountered this error -- we don't know what it expected geth to do, and how geth failed. So perhaps file this ticket over a prysm instead? I'll give a quick try to summon them here though @terencechain @potuz

@potuz
Copy link

potuz commented Oct 3, 2022

You are reporting issues using prysm-output. We don't really know prysm that much in depth, so have no idea what prysm was doing when it encountered this error -- we don't know what it expected geth to do, and how geth failed. So perhaps file this ticket over a prysm instead? I'll give a quick try to summon them here though @terencechain @potuz

This particular timeout being reported in this issue arises when prysm is being used with a feature flag to discard execution payloads from its database. When a peer requests an historical batch, we requests the payloads by hash from geth to reassemble the full blocks and serve them to the peer. I have a few disjoint comments, so I'll leave them here in decreasing order of importance

  • The PR linked above by @lukaszsamson should help in this particular timeout, I think it is safe to close this issue as "not a geth issue" as it's prysm that is setting the deadline here.
  • These timeouts do not affect client performance in any way, the clients will fail to honor a batch request and one peer may downscore them slightly, it should not be noticeable at all.
  • We are seeing a lot of users experiencing a different timeout on NewPayload during regular sync, these are more serious as these timeouts do affect client performance and we believe that the fix in master that Marius is referring to does help with this as all users reporting these timeouts also report either state healing or database compaction logs from geth. Please ping us if you need help with Prysm logs for these kinds of reports.
  • This particular timeout should dissapear when the engine api adds methods to requests blocks by range

@holiman
Copy link
Contributor

holiman commented Oct 7, 2022

I think it is safe to close this issue as "not a geth issue" as it's prysm that is setting the deadline here.

Thanks a lot @potuz !

@holiman holiman closed this as completed Oct 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants