Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC Tx simulation sometimes fails with account sequence mismatch #11597

Closed
4 tasks
ancazamfir opened this issue Apr 11, 2022 · 16 comments
Closed
4 tasks

gRPC Tx simulation sometimes fails with account sequence mismatch #11597

ancazamfir opened this issue Apr 11, 2022 · 16 comments

Comments

@ancazamfir
Copy link

Summary of Bug

gRPC Tx simulation sometimes fails with account sequence mismatch. This is seen with a higher number of Tx-es spawning multiple blocks.

Version

v0.45.1
(running gaia v7.0.0)

Steps to Reproduce

To reproduce I run hermes relayer with a high number of transactions. In the log below there are 526 IBC messages generated. Hermes bundles 15 per transaction and performs a tx simulate followed by broadcast_tx_sync. Normally if simulation fails hermes breaks the loop but I modified it here to continue. The error in this run is:

 "account sequence mismatch, expected 1509, got 1512: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "5481"} }

But then the next simulate seems perfectly fine, the simulate for seg 1513 works fine.

The symptom is that the simulation context sometimes "forgets" previous calls but recovers quickly after. Not sure if there are some caching glitches (maybe around block boundary) or we are using the gRPC wrongly. If the former I expect this to be reproducible with any type of transactions.

In this run:

  • first Tx (seq 1508) was eventually included in block 5481
  • the next (1508-1538), including the one that failed simulation (seq 1512) were included in 5482:
    • 1508-1511 simulation works fine
    • 1512 simulation fails
    • 1513-... simulation works fine
  • the last (1539-1543) were included in 5483

Below are more logs up acc seq 1513 (all the rest after were fine)

2022-04-11T14:30:58.640661Z  INFO ThreadId(59) packet_cmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}:relay{odata=PBE8O38QpJ ->Destination @1-5376; len=525}: assembled batch of 526 message(s)

2022-04-11T14:30:58.641176Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: sending 15 messages using account sequence AccountSequence(1508)
2022-04-11T14:30:58.642412Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: max fee, for use in tx simulation: Fee { amount: "4001stake", gas_limit: 4000000 }
2022-04-11T14:30:58.659263Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}:estimate_gas: tx simulation successful, gas amount used: 1138114
2022-04-11T14:30:58.659650Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: send_tx: using 1138114 gas, fee Fee { amount: "2277stake", gas_limit: 2276228 } id=ibc-0
2022-04-11T14:30:58.674143Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(5BD46F5034BBCA706ABC2FEA88A472F68A833D8856D0D888A0BA3202EF9610EE) }

2022-04-11T14:30:58.674424Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: sending 15 messages using account sequence AccountSequence(1509)
2022-04-11T14:30:58.675625Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: max fee, for use in tx simulation: Fee { amount: "4001stake", gas_limit: 4000000 }
2022-04-11T14:30:58.694100Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}:estimate_gas: tx simulation successful, gas amount used: 1162682
2022-04-11T14:30:58.694639Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: send_tx: using 1162682 gas, fee Fee { amount: "2326stake", gas_limit: 2325364 } id=ibc-0
2022-04-11T14:30:58.708255Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(2ADF3899564CB78F34927ACB8CBF6199CA18F237C3716AA599F7EE7D5608FD3C) }

2022-04-11T14:30:58.708523Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: sending 15 messages using account sequence AccountSequence(1510)
2022-04-11T14:30:58.710246Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: max fee, for use in tx simulation: Fee { amount: "4001stake", gas_limit: 4000000 }
2022-04-11T14:30:58.729467Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}:estimate_gas: tx simulation successful, gas amount used: 1162665
2022-04-11T14:30:58.729750Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: send_tx: using 1162665 gas, fee Fee { amount: "2326stake", gas_limit: 2325330 } id=ibc-0
2022-04-11T14:30:58.743948Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(CB837A33A58ADE26B9318C9563E4CFF8E670E089DF3A701F2AB324AE6C2B3F28) }

2022-04-11T14:30:58.744173Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: sending 15 messages using account sequence AccountSequence(1511)
2022-04-11T14:30:58.745432Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: max fee, for use in tx simulation: Fee { amount: "4001stake", gas_limit: 4000000 }
2022-04-11T14:30:58.762630Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}:estimate_gas: tx simulation successful, gas amount used: 1162365
2022-04-11T14:30:58.762906Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: send_tx: using 1162365 gas, fee Fee { amount: "2325stake", gas_limit: 2324730 } id=ibc-0
2022-04-11T14:30:58.776655Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(16A9D37AC2EFC5460F4C1A438F0E92F8A0616D4086DAEF968F9DFF549106774A) }

2022-04-11T14:30:58.776880Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: sending 15 messages using account sequence AccountSequence(1512)
2022-04-11T14:30:58.778153Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: max fee, for use in tx simulation: Fee { amount: "4001stake", gas_limit: 4000000 }
2022-04-11T14:30:58.786158Z  WARN ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}:estimate_gas: failed to simulate tx, falling back on default gas because the error is potentially recoverable: gRPC call failed with status: status: Unknown, message: "account sequence mismatch, expected 1509, got 1512: incorrect account sequence", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "5481"} }
2022-04-11T14:30:58.786400Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: send_tx: using 4000000 gas, fee Fee { amount: "4001stake", gas_limit: 4000000 } id=ibc-0
2022-04-11T14:30:58.801198Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(39C606B3A10D9F9175B2E2AC17108C4F42ABF97A7C3911AA41F95936180C9E4D) }

2022-04-11T14:30:58.801412Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: sending 15 messages using account sequence AccountSequence(1513)
2022-04-11T14:30:58.802668Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: max fee, for use in tx simulation: Fee { amount: "4001stake", gas_limit: 4000000 }
2022-04-11T14:30:58.894121Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}:estimate_gas: tx simulation successful, gas amount used: 1162745
2022-04-11T14:30:58.894369Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: send_tx: using 1162745 gas, fee Fee { amount: "2326stake", gas_limit: 2325490 } id=ibc-0
2022-04-11T14:30:58.907547Z DEBUG ThreadId(40) send_tx_check{id=PBE8O38QpJ}:send_tx{id=ibc-0}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(26C864956FC26924E8B2F2941EB0B7A7F9276676F4660B6BDEDB583A54D5D8CD) }
...

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@alexanderbez
Copy link
Contributor

IIRC, simulation uses the CheckTx state that is built up during a block prior to being committed. The idea that there is high volume here, at least per (relayer) account, can indicate that there is a lag between state that is in CheckTx and ultimately committed.

I don't totally have a good answer to provide here in terms of remediation, but the whole "nonce mismatch" due to high volume accounts has been well reported over time. The only thing I can suggest is to wait a block or two prior to submitting txs or use multiple accounts round-robin style.

@ancazamfir
Copy link
Author

thanks @alexanderbez

IIRC, simulation uses the CheckTx state that is built up during a block prior to being committed. The idea that there is high volume here, at least per (relayer) account, can indicate that there is a lag between state that is in CheckTx and ultimately committed.

But if that's the case wouldn't we see errors during execution (DeliverTx)? And if it's the same state for both simulate and CheckTx, wouldn't we see the error in both? Instead we see that "build up" failing during simulate but not on CheckTx (broadcast_tx_sync doesn't show the error).

..."nonce mismatch" due to high volume accounts has been well reported over time

I tried to find related issues and wasn't able. Could you point me to any? Maybe it will help me understand better the root cause.

...wait a block or two prior to submitting txs

Could you please clarify? I don't understand how it would help.

@alexanderbez
Copy link
Contributor

Both CheckTx and Simulate (which is actually a query under the hood), both use the same state in BaseApp. So if simulate would fail prior to block N being committed, then surely CheckTx for that same tx would fail too. Maybe what is happening is that Simulate and CheckTx are being executed at different block heights, thus having a different "base" state.

@ancazamfir
Copy link
Author

sorry for the late update. After testing more, experimenting with different changes in hermes, talking to @marbar3778, opened informalsystems/hermes#2249 with my understanding and a proposed workaround that seems to work for hermes.
We also discussed with Marko if it would be possible to have a simulateTx flag to ignore the account sequence check. Not sure how feasible, Marko mentioned that he will discuss with you/ SDK team.
thanks for the help with this!

@alexanderbez
Copy link
Contributor

We also discussed with Marko if it would be possible to have a simulateTx flag to ignore the account sequence check. Not sure how feasible, Marko mentioned that he will discuss with you/ SDK team.

This is extremely trivial and easy:

diff --git a/x/auth/ante/sigverify.go b/x/auth/ante/sigverify.go
index c77632a7d0..2fb07dda4f 100644
--- a/x/auth/ante/sigverify.go
+++ b/x/auth/ante/sigverify.go
@@ -265,7 +265,7 @@ func (svd SigVerificationDecorator) AnteHandle(ctx sdk.Context, tx sdk.Tx, simul
                }
 
                // Check account sequence number.
-               if sig.Sequence != acc.GetSequence() {
+               if sig.Sequence != acc.GetSequence() && !simulate {

@alexanderbez
Copy link
Contributor

@marbar3778 is this what you had in mind?

@ancazamfir
Copy link
Author

Another possibility discussed with @marbar3778, @sergio-mena and @adizere is to have Simulate gRPC return an error during recheck Tx. Would require a new abci flag to signal the last rechecked Tx.
This would handle the root cause we were seeing in informalsystems/hermes#2249

@alexanderbez
Copy link
Contributor

Mhhh simulation is just an ABCI query. What does that have to do with the ReCheckTx flow?

@ancazamfir
Copy link
Author

ancazamfir commented Nov 17, 2022

We want to be able to tell if a Simulation error happens while mempool is doing recheck tx. hermes caches the account sequence and is not aware of when recheck happens. Currently when we get the account mismatch errors during simulation, we dig out in the error log and assume that an account mismatch error with got > expected is caused by this scenario which may not be the case.

@alexanderbez
Copy link
Contributor

So you're saying that you think that Hermes is getting the sequence mismatch error due to the Simulation query happening concurrently with ReCheckTx for that same tx?

If so, I have no Idea how the Simulation query would know if ReCheckTx is happening concurrently for that tx...or even in general?

@ancazamfir
Copy link
Author

So you're saying that you think that Hermes is getting the sequence mismatch error due to the Simulation query happening concurrently with ReCheckTx for that same tx?

Not for the same Tx, a new one. When hermes simulates a Tx(seq=N), it does so after it had previously broadcasted (broadcast_tx_sync) successfully Tx(seq=1),..., Tx(seq=N-1). Some may have been included in blocks, some of the recent ones may be in the mempool and in some cases being recheckTx-ed. If they are being rechecked simulation of the new one fails.

@alexanderbez
Copy link
Contributor

So I brought up an approach today during today's SDK community call that I think was confirmed would work as a reasonable solution. The idea is to bypass the nonce check during simulation. I've pushed a demo PR for experimentation purposes HERE. Please test it out and lmk what you think. If it works, then we can push it through.

@ancazamfir
Copy link
Author

Thanks @alexanderbez

This partially solves the problem. The issues I see is in the case simulation failures (without your changes) are legitimate, for example same wallet has been used outside the relayer. In this case we get a simulation error of type got < expected and we recache the account sequence. Without this simulation error, hermes ends up submitting a transaction that fails in checkTx.
Also, with your changes, we still hit errors in some cases due to the fact that recheck is in progress. This is when the transaction are "ordered" in other way than the account nonce. An example is for ordered IBC channels where we will still get a simulation error, something like packet sequence ≠ next receive sequence (X ≠ Y), with X > Y. It happens if Tx(packet sequence X) is simulated before the Tx(packet sequence Y) has been rechecked.

I did run some test with your PR and I do see less simulation errors. Will do more testing but in summary, it is better except in the cases above. We should probably also look if we break in other ways hermes, other relayers, or other user flows.

@tac0turtle
Copy link
Member

solved with #18641

@adizere
Copy link
Contributor

adizere commented Jan 9, 2024

Great work SDK team!

We'll be tracking integration in Hermes against this feature in informalsystems/hermes#3763.

@alexanderbez
Copy link
Contributor

You guys will love this feature I think. It should make relayers drastically simpler IMO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants