Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollup stops posting to DA layer after a while #1520

Open
S1nus opened this issue Feb 1, 2024 · 15 comments
Open

Rollup stops posting to DA layer after a while #1520

S1nus opened this issue Feb 1, 2024 · 15 comments
Assignees

Comments

@S1nus
Copy link
Member

S1nus commented Feb 1, 2024

Version of Git SHA

Observed the error on #1424, however it probably occurs on main as well.

System OS

Mac OS

Steps to reproduce it

Start a gm rollup, non-lazy aggregation, let it run for several hours. Eventually, the successfully submitted Rollkit block to DA layer messages stop appearing in the logs, but blocks still seem to get gossiped, stored, and indexed.

Expected result

successfully submitted Rollkit block to DA layer log messages keep up with the chain

Actual result

rollup seems to stop posting blocks to DA.

Notes

If it's failing silently, we need to ensure that the sequencer stops producing blocks, or else it becomes at risk of progressing too far for DA to keep up.

address the underlying cause if possible, and don't fail silently

@S1nus S1nus added the T:bug Something isn't working label Feb 1, 2024
@S1nus
Copy link
Member Author

S1nus commented Feb 1, 2024

it seems to start after this throws. perhaps it's not properly recovering from incorrect account sequence?

ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence"

@jcstein
Copy link
Member

jcstein commented Feb 2, 2024

how long is a while? I noticed that between batches posting on Polaris, the time went from every block or so, to every minute, to every few minutes, eventually lagging 15 minutes between batch postings. during that time rollkit was still producing blocks, some 1000 extra blocks, with no warnings or errors then, after posting the last batch, the chain crashed after running for around one hour with the same error that @mycodecrafting is seeing:

5:49PM ERR failed to prepare proposal err="invalid timestamp, parent 1706914174 given 1706914174" height=3573 module=server

i originally suggested to @mycodecrafting to change the block time to 2s, but i don't think changing the block time will do anything but slow down the rate at which the chain fails. it'll inevitably hit the point where it begins to lag so much, based on the number of blobs being posted IMO, that it can't fit them into a pfb and it gets stuck in mempool causing issues with acc sequence errors, eventually resulting in failed txs. (namespace on celenium to see the different batch postings click the "messages" tab)

image

root cause

5:49PM ERR failed to build payload err="invalid timestamp, parent [1706914174](tel:1706914174) given [1706914174](tel:1706914174)" module=server
5:49PM ERR failed to prepare proposal err="invalid timestamp, parent [1706914174](tel:1706914174) given [1706914174](tel:1706914174)" height=3573 module=server
5:49PM ERR failed to process proposal err="failed to find envelope in proposal" hash=2A4784028005D12EBD413A5BCDAC7B470DFF4B801D74D202BD0813B54E9C953A height=3573 module=server
panic: error while processing the proposal: <nil>

goroutine 356 [running]:
[github.com/rollkit/rollkit/block.(*Manager).publishBlock(0x140006a92c0](http://github.com/rollkit/rollkit/block.(*Manager).publishBlock(0x140006a92c0), {0x1049062e0, 0x14001454730?})
 [github.com/rollkit/rollkit@v0.11.19/block/manager.go:732](http://github.com/rollkit/rollkit@v0.11.19/block/manager.go:732) +0xbd8
[github.com/rollkit/rollkit/block.(*Manager).AggregationLoop(0x140006a92c0](http://github.com/rollkit/rollkit/block.(*Manager).AggregationLoop(0x140006a92c0), {0x1049062e0, 0x14001454730}, 0x0)
 [github.com/rollkit/rollkit@v0.11.19/block/manager.go:274](http://github.com/rollkit/rollkit@v0.11.19/block/manager.go:274) +0x1d0
[github.com/rollkit/rollkit/node.(*FullNode).OnStart.func1()](http://github.com/rollkit/rollkit/node.(*FullNode).OnStart.func1())
 [github.com/rollkit/rollkit@v0.11.19/node/full.go:379](http://github.com/rollkit/rollkit@v0.11.19/node/full.go:379) +0x30
[github.com/rollkit/rollkit/types.(*ThreadManager).Go.func1()](http://github.com/rollkit/rollkit/types.(*ThreadManager).Go.func1())
 [github.com/rollkit/rollkit@v0.11.19/types/threadmanager.go:26](http://github.com/rollkit/rollkit@v0.11.19/types/threadmanager.go:26) +0x58
created by [github.com/rollkit/rollkit/types.(*ThreadManager).Go](http://github.com/rollkit/rollkit/types.(*ThreadManager).Go) in goroutine 1
 [github.com/rollkit/rollkit@v0.11.19/types/threadmanager.go:24](http://github.com/rollkit/rollkit@v0.11.19/types/threadmanager.go:24) +0x7c
make: * [start] Error 2

full logs & version info

versions i am using:

  • rollkit-main of polaris
  • celesta-da v0.12.9
  • mocha network
  • namespace on celenium
  • block time: 1s
  • os: macos

polaris logs: https://app.warp.dev/block/rgCrw9OZY4TPapRK2gkyth
light node logs: https://app.warp.dev/block/tGCV5MLNTxp1tj2s5f1Bfa

@jcstein
Copy link
Member

jcstein commented Feb 3, 2024

testing with gm world rollup

i decided to leave the same test running overnight with gm world, using default versions in docs, v0.11.19 rollkit, v28.1.0 ignite

results

  1. chain posted first ~9000 blocks to DA
  2. significant lag between postings began in first few hours
  3. after this, "blob too big" errors began to show due to no logic in place to deal with backlog of blocks
  4. chain is still running, celestia-da has crashed, and chain has produced ~50000 more blocks than have not been posted to DA, with few error logs
  5. light node seems to have just gave up based on these full logs from the attempts to post blobs that were too
  6. full rollup logs (i manually stopped chain) show both "unable to get DA max blob size" and "failed to submit block to DA layer after 30 attempts", without any other errors saying "light node/DA isn't running anymore"
  7. namespace on celenium

@S1nus
Copy link
Member Author

S1nus commented Feb 3, 2024

Just attempted to manually test using this commit of celestia-da. I got this error output from Rollkit

2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = rpc error: code = Unknown desc = timed out waiting for tx to be included in a block" attempt=1 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=2 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=3 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=4 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=5 module=BlockManager
2:35PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=6 module=BlockManager

The chain progresses despite the DA submission failure. We need to stop block production if pendingBlocks start to pile up past a certain number.

We also need to solve all the underlying errors with DA submission

@S1nus
Copy link
Member Author

S1nus commented Feb 7, 2024

including another example of poor dev ex relating to this problem:

▶ bash restart-local.sh 
1:43PM INF starting node with ABCI CometBFT in-process module=server
1:43PM INF starting node with Rollkit in-process module=server
1:43PM INF service start impl=multiAppConn module=proxy msg="Starting multiAppConn service"
1:43PM INF service start connection=query impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start connection=snapshot impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start connection=mempool impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start connection=consensus impl=localClient module=abci-client msg="Starting localClient service"
1:43PM INF service start impl=EventBus module=events msg="Starting EventBus service"
1:43PM INF service start impl=PubSub module=pubsub msg="Starting PubSub service"
1:43PM INF service start impl=IndexerService module=txindex msg="Starting IndexerService service"
1:43PM INF service start impl=RPC module=server msg="Starting RPC service"
1:43PM INF service start impl=Node module=server msg="Starting Node service"
1:43PM INF starting P2P client module=server
1:43PM INF serving HTTP listen address=127.0.0.1:36657 module=server
1:43PM INF listening on address=/ip4/127.0.0.1/tcp/36656/p2p/12D3KooWELDpQFaaJqZBMNrL93KUHxVLsByEp5DJWuLhupJZ6x7F module=p2p
1:43PM INF listening on address=/ip4/192.168.1.187/tcp/36656/p2p/12D3KooWELDpQFaaJqZBMNrL93KUHxVLsByEp5DJWuLhupJZ6x7F module=p2p
1:43PM INF no seed nodes - only listening for connections module=p2p
1:43PM INF working in aggregator mode block time=1000 module=server
1:43PM INF Creating and publishing block height=64 module=BlockManager
1:43PM INF starting gRPC server... address=localhost:9090 module=grpc-server
1:43PM INF finalized block block_app_hash=631340C19AF7DFD60DF5763E3E2259F0B0360258D039EF70C18E73BC175FEFDD height=64 module=BlockManager num_txs_res=0 num_val_updates=0
1:43PM INF executed block app_hash=631340C19AF7DFD60DF5763E3E2259F0B0360258D039EF70C18E73BC175FEFDD height=64 module=BlockManager
1:43PM INF indexed block events height=64 module=txindex
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=2 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=3 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=4 module=BlockManager
1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=5 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=6 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=7 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=8 module=BlockManager
1:44PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=9 module=BlockManager
1:44PM INF successfully submitted Rollkit block to DA layer count=1 daHeight=1120183 module=BlockManager

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

on the initial

1:43PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager

did you already have a chain running previously and the transaction is stuck?

it looks like either the old transaction finally failed or it was overridden by a new one with higher fee?

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

FWIW, when I was testing out a new version of Polaris with v0.12.10 of celestia-da, I did not encounter the same issues with blocks piling up. Here is the issue linked with testing setup and logs for reference.

I am testing a GM rollup now with v0.12.10 of celestia-da. will report back here

edit: i jinxed this.

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

I'm hitting the same bugs on this.

setup

  1. celestia-da v0.12.10 with mocha light node
  2. rollkit v0.11.19
  3. gm world rollup from tutorial, posting to mocha instead of arabica
  4. running in this namespace

results

7 batch postings and 564 rollup blocks in, out of 2000 total blocks have been posted. ultimately, the chain stopped posting at block 564 on the rollup and has been stuck in an incorrect account sequence error since block 873. the rollup node, since block start of chain has been giving errors:

first error - timed out waiting to be included in a block

2:25PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = rpc error: code = Unknown desc = timed out waiting for tx to be included in a block" attempt=1 module=BlockManager

second error - tx already in mempool

2:25PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=2 module=BlockManager

third error - 30 attempts failed

2:32PM ERR error while submitting block to DA error="failed to submit block to DA layer after 30 attempts" module=BlockManager

fourth error - incorrect account sequence

2:32PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager

ultimately, the chain stopped posting at block 564 on the rollup and has been stuck in an incorrect account sequence error since block 873

logs

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

welp

update...because i go back to check celenium after i have stopped the chain and it has posted the last 1400 blocks in 2 blobs

image

the inconsistency though is worrisome. i am starting up another test of this and will let it run longer.

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

testing again

off the jump, i hit errors too of 1. timeout -> 2. tx already in mempool -> 3. failed after 30 attempts -> 4. incorrect account sequence 💀

3:18PM INF indexed block events height=65 module=txindex
3:18PM INF Creating and publishing block height=66 module=BlockManager
3:18PM INF finalized block block_app_hash=AA4DE08E50D9CC5C02C63D84D6D36B2AA67E180AC1BAF98FBC96A32A44F1F901 height=66 module=BlockManager num_txs_res=0 num_val_updates=0
3:18PM INF executed block app_hash="�M��P��\\\x02�=���k*�~\x18\n�������*D��\x01" height=66 module=BlockManager
3:18PM INF indexed block events height=66 module=txindex
3:18PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = rpc error: code = Unknown desc = timed out waiting for tx to be included in a block" attempt=1 module=BlockManager
3:18PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=2 module=BlockManager
3:18PM INF Creating and publishing block height=67 module=BlockManager
3:18PM INF finalized block block_app_hash=37B23F80A55DC33C82102F484F2786C5D95327F55443CA2B7C5BB33371341968 height=67 module=BlockManager num_txs_res=0 num_val_updates=0
3:18PM INF executed block app_hash="7�?��]�<�\x10/HO'���S'�TC�+|[�3q4\x19h" height=67 module=BlockManager
3:18PM INF indexed block events height=67 module=txindex
3:18PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : tx already in mempool" attempt=3 module=BlockManager
[...]
3:24PM INF indexed block events height=431 module=txindex
3:24PM ERR error while submitting block to DA error="failed to submit block to DA layer after 30 attempts" module=BlockManager
3:24PM ERR DA layer submission failed error="failed to submit blocks: rpc error: code = Unknown desc = : incorrect account sequence" attempt=1 module=BlockManager

update

a blob went through, 500+ blocks after it should have on the rollup
https://mocha.celenium.io/namespace/000000000000000000000000000000000000676f6f646d6f726e696e
the issue is that the cycle of 1. timeout -> 2. tx already in mempool -> 3. failed after 30 attempts -> 4. incorrect account sequence starts again at this point. leading to a massive backlog of blocks piling up.

800+ blocks in, 15 blocks have been posted to DA

Screenshot 2024-02-07 at 3 31 44 PM

second blob has been posted 17 minutes later

446/1600 blocks have been posted to DA
Screenshot 2024-02-07 at 3 44 16 PM

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

I'd like to stop adding to this thread, but new issues keep popping up. At time of writing this, 2600 blocks had been created, but 3348 blobs have been posted to the namespace

Screenshot 2024-02-07 at 4 00 31 PM

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

there are now 2.5 times as many blobs posted to DA as there are rollup blocks that have been posted.

9143/3647 = ~2.50

there is no real way for me to know if this rollup has actually posted everything, or just posted the same blocks multiple times.

Screenshot 2024-02-07 at 4 17 52 PM

@jcstein
Copy link
Member

jcstein commented Feb 7, 2024

i'm now stopping the chain, because no new blobs have been posted, 3000 more blocks have been produced

Screenshot 2024-02-07 at 5 08 54 PM

@jcstein
Copy link
Member

jcstein commented Feb 8, 2024

It seems the source of the last issue (last 1, 2, 3 comments) where the same batch was posted more than once, is related to #1472

@tuxcanfly
Copy link
Contributor

With #1535 we're handling common DA congestion errors like

  • timed out waiting for tx to be included in a block
  • incorrect account sequence
  • insufficient fee
  • tx already in mempool
  • tx too big

Most of these errors originate from insufficient gas price. We're now adding an increased timeout and retrying the transaction based on the error message.

We can improve this by estimating gas price from third party providers after celestiaorg/celestia-app#3114 is implemented.

Until then we support dynamic gas price retargeting by using the --gas-multiplier flag, documented in rollkit/docs#328

@Manav-Aggarwal Manav-Aggarwal added v1 polaris v2 and removed T:bug Something isn't working P:high Priority: High va v1 labels Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants