Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC node falls behind since >=v11 when queried #2733

Open
5 tasks
exp0nge opened this issue Sep 16, 2023 · 6 comments
Open
5 tasks

RPC node falls behind since >=v11 when queried #2733

exp0nge opened this issue Sep 16, 2023 · 6 comments
Labels
status: waiting-triage This issue/PR has not yet been triaged by the team. type: bug Issues that need priority attention -- something isn't working

Comments

@exp0nge
Copy link

exp0nge commented Sep 16, 2023

Summary of Bug

We have noticed severe performance degradation since v11 where nodes we query against fall behind very quickly. Attached below are some logs. This behavior is not notice-able in non-RPC mode nodes (nodes that are not being queried).

2:29PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=10 numToDial=0
2:29PM INF Timed out dur=3000 height=16902284 module=consensus round=0 step=3
2:29PM INF commit synced commit=436F6D6D697449447B5B323520363120323232203133372032353220323531203133352032353220313237203239203139362031303320363820353220393420373320323238203130382031373620323231203234312031313320
2:29PM INF committed state app_hash=193DDE89FCFB87FC7F1DC46744345E49E46CB0DDF17162ACC7A3BD207F3318D0 height=16902284 module=state num_txs=5
2:29PM INF Timed out dur=-542.021502 height=16902285 module=consensus round=0 step=1
2:29PM INF indexed block exents height=16902284 module=txindex
2:29PM INF commit is for a block we do not know about; set ProposalBlock=nil commit=A9AE18C2422F1781EC1046E6B11EF70A3E2E679FD2301E33C2622AF4D09850FF commit_round=0 height=16902285 module=consensus proposal={}
2:29PM INF received complete proposal block hash=A9AE18C2422F1781EC1046E6B11EF70A3E2E679FD2301E33C2622AF4D09850FF height=16902285 module=consensus
2:29PM INF finalizing commit of block hash={} height=16902285 module=consensus num_txs=1 root=193DDE89FCFB87FC7F1DC46744345E49E46CB0DDF17162ACC7A3BD207F3318D0
2:29PM INF minted coins from module account amount=11906530uatom from=mint module=x/bank
2:29PM INF client state updated client-id=07-tendermint-913 height=1-5327682 module=x/ibc/client
2:29PM INF packet acknowledged dst_channel=channel-0 dst_port=transfer module=x/ibc/channel sequence=166437 src_channel=channel-391 src_port=transfer
2:29PM INF executed block height=16902285 module=state num_invalid_txs=0 num_valid_txs=1
2:29PM INF Connection is closed @ recvRoutine (likely by the other side) conn={"Logger":{}} module=p2p peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM INF service stop impl={"Logger":{}} module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}
2:29PM INF service stop impl={"Data":{},"Logger":{}} module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM INF Reconnecting to peer addr={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956} module=p2p
2:29PM INF service start impl="Peer{MConn{65.108.227.217:14956} 20e1000e88125698264454a884812746c2eb4807 out}" module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14
2:29PM INF service start impl=MConn{65.108.227.217:14956} module=p2p msg={} peer={"id":"20e1000e88125698264454a884812746c2eb4807","ip":"65.108.227.217","port":14956}
2:29PM INF Timed out dur=3000 height=16902285 module=consensus round=0 step=3
2:29PM INF commit synced commit=436F6D6D697449447B5B313130203230332031343720323236203230372031343920313134203138352034332031353720363520313238203139332031323920383920373420333920343420313033203135322032323520373720
2:29PM INF committed state app_hash=6ECB93E2CF9572B92B9D4180C181594A272C6798E14D40A95E83439F28D963B4 height=16902285 module=state num_txs=1
2:29PM INF Timed out dur=-6316.474049 height=16902286 module=consensus round=0 step=1
2:29PM INF indexed block exents height=16902285 module=txindex
2:29PM INF commit is for a block we do not know about; set ProposalBlock=nil commit=7B970C0472E419D0DAF04DC28B6158C219360B37852D9547884B6D47A1217585 commit_round=0 height=16902286 module=consensus proposal={}
2:29PM INF received complete proposal block hash=7B970C0472E419D0DAF04DC28B6158C219360B37852D9547884B6D47A1217585 height=16902286 module=consensus
2:29PM INF Connection is closed @ recvRoutine (likely by the other side) conn={"Logger":{}} module=p2p peer={"id":"57a5297537b9b6ef8b105c08a8ad3f6ac452c423","ip":"51.81.35.11","port":1618}
2:29PM INF service stop impl={"Logger":{}} module=p2p msg={} peer={"id":"57a5297537b9b6ef8b105c08a8ad3f6ac452c423","ip":"51.81.35.11","port":1618}
2:29PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

cc @ckarabats

Version

>=v11

Steps to Reproduce


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
  • Is a spike necessary to map out how the issue should be approached?
@exp0nge exp0nge added type: bug Issues that need priority attention -- something isn't working status: waiting-triage This issue/PR has not yet been triaged by the team. labels Sep 16, 2023
@MSalopek
Copy link
Contributor

Possibly related to #2415, #2726.

Have you tried running your node in debug mode (temporarily) and checking which RPC endpoints are being queried?
You can do so by setting this in your config.toml:

# Output level for logging, including package level options
log_level = "debug"

Please note that you may need to set up additional software in case you are exposing your node to the public, such as a load balancer or a proxy, much like you would if you were exposing any other backend service.

@exp0nge
Copy link
Author

exp0nge commented Sep 25, 2023

Possibly related to #2415, #2726.

Have you tried running your node in debug mode (temporarily) and checking which RPC endpoints are being queried? You can do so by setting this in your config.toml:

# Output level for logging, including package level options
log_level = "debug"

Please note that you may need to set up additional software in case you are exposing your node to the public, such as a load balancer or a proxy, much like you would if you were exposing any other backend service.

The following gRPC requests are used:

  • tx.v1beta1: BroadcastTxRequest, GetTxRequest, SimulateRequest
  • bank.v1beta1: QueryAllBalancesRequest
  • auth.v1beta1: BaseAccount, QueryAccountRequest
  • base.tendermint.v1beta1: GetBlockByHeightRequest, GetLatestBlockRequest

In particular GetBlockByHeightRequest is used heavily alongside GetTxRequest (I am switching over to GetEvents to batch the per tx call)

@exp0nge
Copy link
Author

exp0nge commented Sep 25, 2023

Possibly related to #2415, #2726.

Have you tried running your node in debug mode (temporarily) and checking which RPC endpoints are being queried? You can do so by setting this in your config.toml:

# Output level for logging, including package level options
log_level = "debug"

Please note that you may need to set up additional software in case you are exposing your node to the public, such as a load balancer or a proxy, much like you would if you were exposing any other backend service.

By the way, I read through some of the suggestions in #2415, but this is a perf. regression. We have solved in creative ways, but the nodes consistently die.

edit: We trust Interchain's/this repo's interpretation of the blockchain state, adding a layer of indrection like an explorer, while possible, is something we would very much like to avoid.

@mmulji-ic
Copy link
Contributor

edit: We trust Interchain's/this repo's interpretation of the blockchain state, adding a layer of indrection like an explorer, while possible, is something we would very much like to avoid.

Indeed @exp0nge , every layer and redirection adds overhead and complexity - thats something that we always want to avoid if possible. The performance bottlenecks that you've mentioned, regression or otherwise, lie with the lower levels of the stack (SDK / Comet). If theres a specific endpoint that has regressed in terms of performance, then we can check with the relevant teams. Are the following endpoints GetBlockByHeightRequest & GetTxRequest causing most problems?

@exp0nge
Copy link
Author

exp0nge commented Nov 3, 2023

edit: We trust Interchain's/this repo's interpretation of the blockchain state, adding a layer of indrection like an explorer, while possible, is something we would very much like to avoid.

Indeed @exp0nge , every layer and redirection adds overhead and complexity - thats something that we always want to avoid if possible. The performance bottlenecks that you've mentioned, regression or otherwise, lie with the lower levels of the stack (SDK / Comet). If theres a specific endpoint that has regressed in terms of performance, then we can check with the relevant teams. Are the following endpoints GetBlockByHeightRequest & GetTxRequest causing most problems?

Unsure exactly which call causes this as we use the ones above listed, but the most heavily used ones are:
tx.v1beta1: BroadcastTxRequest, GetTxRequest

@MSalopek
Copy link
Contributor

@exp0nge Hey, has this issue been fixed on v15.1.0 for your usecase?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-triage This issue/PR has not yet been triaged by the team. type: bug Issues that need priority attention -- something isn't working
Projects
Status: 🤔 F1: Investigate
Development

No branches or pull requests

3 participants