Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CV Optimistic Export and Import #878

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

chillyvee
Copy link
Contributor

Optimistic Export and Import aims to reduce p2p statesync recovery times.

Validators attempting to recover nodes for cosmos-sdk chains with significant state size may be slashed for downtime prior to recovery.

Although all data is verified during tree build, rebuilding the entire IAVL tree using minimal data incurs high CPU overhead.

Evmos for example takes over 16 hours to rebuild IAVL on a "fast" CPU/DISK machine and can appear to "never finish". The node must then fast-sync 16 hours of missed blocks which can take an additional 4-5 hours. Slashing for downtime is within 30 hours meaning a validator must succeed on the first try or be slashed.

Since statesync appears to "hang", validators download a 244GB file from Polkachu to restore data directories. This is possible because Polkachu is a trusted party. Validators do not appear to validate the correctness of the IAVL tree so this method is less trusted but necessary. The problem is when Polkachu only makes 1 snapshot every 24 hours. A validator may need to fast-sync 23 hours of lost block, wait for the next Polkachu snapshot, or try another random less known snapshot.

Optimistic statesync enables faster database import at the restoring node by directly inserting the final key/value pairs into the database. As long as validators statesync from a trusted node, this can be a good alternative to downloading data directory zip files.

Cosmos-sdk should broadcast a different snapshot format so that nodes can select optimistic statesync node sources.

There are some code locations below where there may be more efficient methods to obtain the raw key/value storage than recalculation.

Comments welcome.

@chillyvee chillyvee requested a review from a team as a code owner February 4, 2024 11:57
Copy link

coderabbitai bot commented Feb 4, 2024

Walkthrough

The updates to the iavl package introduce an optimistic mode for both exporting and importing tree data, aimed at enhancing performance by allowing raw key-value pairs to be handled without verification. This mode is supported by new methods in the Exporter and Importer structs, along with corresponding changes in the tree structures (ImmutableTree and MutableTree) to support the export and import of nodes in this optimistic manner. The changes include the ability to swap key-value pairs in a tree structure and to manage node data more efficiently at the byte level.

Changes

Files Change Summaries
export.go, export_test.go Introduced an optimistic flag and methods for creating exporters with options, including an optimisticExport method. Added tests for tree exports and imports for consistency.
immutable_tree.go, mutable_tree.go Added OptimisticExport() and OptimisticImport() methods to support exporting and importing tree nodes in an optimistic manner, facilitating the recreation of identical trees.
import.go, import_test.go Added a new optimistic flag and functions for optimistic imports, including a specific OptimisticAdd function to add key-value pairs without verification. Modified the Commit function for handling optimistic imports and added tests covering different node configurations.
nodedb.go Introduced methods for managing node key and value bytes (GetNodeKeyValueBytes and SaveNodeKeyValueBytes), supporting both legacy and new node formats. Includes functionality for fetching and saving nodes to disk with raw byte arrays.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Note: Auto-reply has been disabled for this repository by the repository owner. The CodeRabbit bot will not respond to your comments unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository from git and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

@yihuang
Copy link
Collaborator

yihuang commented Feb 4, 2024

Do you use memiavl in evmos, that should only take dozens of minutes to restore.

@chillyvee
Copy link
Contributor Author

chillyvee commented Feb 4, 2024

Do you use memiavl in evmos, that should only take dozens of minutes to restore.

Will try it. Thank you.

What do you think of the concept in this PR? Is it worth reducing CPU usage and inserting keys from a trusted node (for example if a validator runs 2 nodes and wants to copy keys between goleveldb databases)

@yihuang
Copy link
Collaborator

yihuang commented Feb 4, 2024

Optimistic statesync enables faster database import at the restoring node by directly inserting the final key/value pairs into the database

what do you mean by "final key/value pairs", do you mean only handle leaf key values skipping branch nodes?

@chillyvee
Copy link
Contributor Author

chillyvee commented Feb 4, 2024

what do you mean by "final key/value pairs", do you mean only handle leaf key values skipping branch nodes?

Usually for the restoring node to write that to disk, the hash of all children must be computed before parents can be written to disk.

However it is unnecessary to rehash all data and rebuild that IAVL tree. The receiver can write down the result of all the work.

Instead of

db.Set(recursive hashAllChildenAndBuildIAVLTree, value)

Assuming the source is trusted, It is faster to send the resulting key and value

Sender:

resultingKey = hash("k")
resultingValue = buffer.Write(....)
send(resultingKey, resultingValue)

Receiver:

db.Set(resultingKey, resultingValue) <- no tree build or hash operations

Then the node will start as if all key/values were in the database for the current version/blockchain height.

The goal is to make a restore takes no longer than a CSV file copy over ftp/http and insert all key/values into leveldb.

@yihuang
Copy link
Collaborator

yihuang commented Feb 4, 2024

what do you mean by "final key/value pairs", do you mean only handle leaf key values skipping branch nodes?

Usually for the restoring node to write that to disk, the hash of all children must be computed before parents can be written to disk.

However it is unnecessary to rehash all data and rebuild that IAVL tree. The receiver can write down the result of all the work.

Instead of

db.Set(recursive hashAllChildenAndBuildIAVLTree, value)

Assuming the source is trusted, It is faster to send the resulting key and value

Sender:

resultingKey = hash("k")
resultingValue = buffer.Write(....)
send(resultingKey, resultingValue)

Receiver:

db.Set(resultingKey, resultingValue) <- no tree build or hash operations

Then the node will start as if all key/values were in the database for the current version/blockchain height.

I see, so the data written into application.db is the same right? but I guess the db write is the slowest part, for example memiavl is much faster despite it also rebuild the iavl tree from current state sync snapshot format.

@chillyvee
Copy link
Contributor Author

I see, so the data written into application.db is the same right?

Yes same result to goleveldb (or other disk based iavl)

but I guess the db write is the slowest part, for example memiavl is much faster despite it also rebuild the iavl tree from current state sync snapshot format.

Yes, memiavl as a solution when RAM is sufficient.

And this solution when disk write speeds are acceptable (with no other change in expected RAM) and utilize more disk based strategy.

@yihuang
Copy link
Collaborator

yihuang commented Feb 4, 2024

I see, so the data written into application.db is the same right?

Yes same result to goleveldb (or other disk based iavl)

but I guess the db write is the slowest part, for example memiavl is much faster despite it also rebuild the iavl tree from current state sync snapshot format.

Yes, memiavl as a solution when RAM is sufficient.

memiavl doesn't need much memory actually in our practice.

@chillyvee
Copy link
Contributor Author

memiavl doesn't need much memory actually in our practice.

Will get back to you after some testing. Having a hard time getting the right rpc/peers for evmos statesync.

@chillyvee
Copy link
Contributor Author

chillyvee commented Feb 24, 2024

Tried to use memiavl to restore evmos. Each time, the data directory is completely deleted and init is run with the chain-id evmos_9001-2.

After getting all statesync parts, the node crashes and upon start, this message appears

panic: invalid chain-id on BeginBlock; expected: , got: evmos_9001-2

The faster iavl build time is confirmed (3+ hours. Same even with local snapshot files, no network dependency + restore on a fast machine with high ram + raid0 nvme), but the node itself doesn't seem to get the chain-id, so full operation could not be confirmed on evmos.

Noticed that there is a fair amount of code changes and dependency on the cronos chain repository (github.com/crypto-org-chain/cronos/memiavl) necessary to get enable memiavl for a new chain. That might be more than a chain developer is able to do and maintain. Another thought is to bring memiavl mainline into cosmos/iavl or cosmos/memiavl. yihuang, you are most familiar so you can probably give a good comment on options or best way to implement.

This optimisitic export and import allows minimal code changes and faster restores.

@yihuang
Copy link
Collaborator

yihuang commented Feb 24, 2024

The faster iavl build time is confirmed (3+ hours. Same even with local snapshot files, no network dependency + restore on a fast machine with high ram + raid0 nvme)

what is your testing procedure, how large is the snapshot file, it's dozens of minutes for us, do you use the local state sync commands (./simappd snapshots --help)?

@chillyvee
Copy link
Contributor Author

chillyvee commented Feb 25, 2024

Did not realize the chain-id was pulled from client.toml. Fixed that and statesync with memiavl works. Will let evmos run for a few more hours to generate snapshots for local restore testing.

Noticed a failed statesync snapshot. Suspect that memiavl pruned the oldest height before the statesync snapshot could finish. Future statesync snapshots were not generated. Increased memiavl snapshots held to allow more time for statesync export.

Do you think it would help to document the relation between these app.toml settings so that a configuration is practical?

pruning-keep-recent = "500000"

[memiavl]
  snapshot-interval = 1000
  snapshot-keep-recent = 1

[state-sync]
  snapshot-interval = 100000
  snapshot-keep-recent = 5

Statesync for evmos took about 1h40m to download and apply 2924 chunks over p2p. Snapshot was old and took 4 hours to BlockSync to the latest block.

Locally disk reads are fast. Reading all 2924 chunks takes only 1m30s.

Will retry again when a local snapshot is successfully created for a more recent height.

@yihuang
Copy link
Collaborator

yihuang commented Feb 25, 2024

Noticed a failed statesync snapshot. Suspect that memiavl pruned the oldest height before the statesync snapshot could finish. Future statesync snapshots were not generated. Increased memiavl snapshots held to allow more time for statesync export.

do you have the error message?
Pruning is just deleting the files from filesystem, according to my understanding, the already opened file handlers should still work as normal, including the mmap regions created on that, should still works until you close them.

Do you think it would help to document the relation between these app.toml settings so that a configuration is practical?

There could be different combinations, but for most of the use cases, I think state-sync.snapshot-interval can be simply set to the same as memiavl.snapshot-interval when enabled, so a state sync snapshot is created after each memiavl snapshot is created.
We set memiavl.snapshot-interval to 10000, it can even be larger if you can accept a little bit longer worst-case start-up time, 10000 translates to a few seconds in our case.
I'd also recommend setting the memiavl.snapshot-interval much larger during catching-up, but that needs adjusting manually later on.
The old pruning-* options have no effect on memiavl.

@yihuang
Copy link
Collaborator

yihuang commented Feb 25, 2024

Noticed that there is a fair amount of code changes

I think most of the code changes are for the versiondb integration? memiavl integration itself should be fairly simple.

and dependency on the cronos chain repository (github.com/crypto-org-chain/cronos/memiavl) necessary to get enable memiavl for a new chain. That might be more than a chain developer is able to do and maintain. Another thought is to bring memiavl mainline into cosmos/iavl or cosmos/memiavl. yihuang, you are most familiar so you can probably give a good comment on options or best way to implement.

it's a good question on long-term maintenance, we had a few discussions on storage workgroup calls before, but the decision is to keep it as a third-party thing. personally I'm totally fine and supportive if we maintain it under cosmos org.

@chillyvee
Copy link
Contributor Author

There could be different combinations, but for most of the use cases, I think state-sync.snapshot-interval can be simply set to the same as memiavl.snapshot-interval when enabled, so a state sync snapshot is created after each memiavl snapshot is created.
We set memiavl.snapshot-interval to 10000, it can even be larger if you can accept a little bit longer worst-case start-up time, 10000 translates to a few seconds in our case.
I'd also recommend setting the memiavl.snapshot-interval much larger during catching-up, but that needs adjusting manually later on.
The old pruning-* options have no effect on memiavl.

Sharing a problem encountered for snapshots using memiavl + evmosd on mainnet

ERR failed to create state snapshot err="failed to generate snapshot chunk 0: snapshot is not created yet: height: 19266752" height=192667

Looking at snapshots directory, initial restore height of 19072000 and the next snapshot is available. The rest appear to fail

28G     19072000
28G     19080000
1.5K    19150000
1.5K    19160000
1.5K    19170000
1.5K    19180000
1.5K    19190000
1.5K    19217745
1.5K    19224746
1.5K    19231747
1.5K    19238748
1.5K    19245749
1.5K    19252750
1.5K    19259751
1.5K    19266752

data/memiavl.db has data as expected

memiavl.db
512     current
512     LOCK
171G    snapshot-00000000000019246541
171G    snapshot-00000000000019251544
171G    snapshot-00000000000019256547
171G    snapshot-00000000000019261550
171G    snapshot-00000000000019266553
171G    snapshot-00000000000019271556

evmosd snapshots list only shows the 2 successful data/snapshot heights

% evmosd snapshots list
height: 19080000 format: 3 chunks: 2923
height: 19072000 format: 3 chunks: 2924

This is the current app.toml settings

[memiavl]
  async-commit-buffer = 0
  cache-size = 1000
  enable = true
  snapshot-interval = 5003
  snapshot-keep-recent = 5
  zero-copy = false

[state-sync]
  snapshot-interval = 7001
  snapshot-keep-recent = 2

Are you able to see the same issue on your side? Any alternate settings that you recommend checking?

@yihuang
Copy link
Collaborator

yihuang commented Feb 29, 2024

Sharing a problem encountered for snapshots using memiavl + evmosd on mainnet

ERR failed to create state snapshot err="failed to generate snapshot chunk 0: snapshot is not created yet: height: 19266752" height=192667

Looking at snapshots directory, initial restore height of 19072000 and the next snapshot is available. The rest appear to fail

The fifth parameter in memiavlstore.SetupMemIAVL is supportExportNonSnapshotVersion bool, where false means don't support taking a snapshot on non-snapshot versions.
The reason is there's a specialized algorithm when taking state-sync snapshot from a memiavl snapshot directly, which is more efficient than the generic algorithm that traverses the tree (although in practice, the speed difference seems marginal), so you have two options here:

  1. Set the supportExportNonSnapshotVersion to true, and it'll support exporting snapshots at any version.
  2. Set state-sync.snapshot-interval to be multiples of memiavl.snapshot-interval, so when you take a state-sync snapshot, there's always a memiavl snapshot for it.

I think the second option should be good enough for the majority of use cases, that is why we set the supportExportNonSnapshotVersion to false by default.

@chillyvee
Copy link
Contributor Author

chillyvee commented Feb 29, 2024

Adjusted snapshot settings and retrying

The fifth parameter in memiavlstore.SetupMemIAVL

Cosmos code has many unnamed variables. This can make adding a configuration variable difficult (find if a parameter exists, determine if parameter can be passed, add parameter in calling, edit function call in called library)

Is there any "guidance" against code structure such as:

       baseAppOptions = memiavlstore.SetupMemIAVL(common, options)
       baseAppOptions.setSupportExportNonSnapshotVersion(true)
       baseAppOptions.FinishSetup()

personally I'm totally fine and supportive if we maintain it under cosmos org.

Would you prefer to start that yourself? Or could we prepare a repository and move that to cosmos org once it makes sense?

Do you think it is generally "safe" to add memiavl to any chain of any sdk version with this pattern of code?

	baseAppOptions = memiavlstore.SetupMemIAVL(logger, homePath, appOpts, false, false, baseAppOptions)

	// Setup Mempool and Proposal Handlers
	baseAppOptions = append(baseAppOptions, func(app *baseapp.BaseApp) { 

@yihuang
Copy link
Collaborator

yihuang commented Feb 29, 2024

Cosmos code has many unnamed variables. This can make adding a configuration variable difficult (find if a parameter exists, determine if parameter can be passed, add parameter in calling, edit function call in called library)

yeah, agree on this specific option, but it's usually static, there are several other options that are configurable in app.toml, we can maybe add this one to config file as well.

Do you think it is generally "safe" to add memiavl to any chain of any sdk version with this pattern of code?

I think so, we've been using on production for a while, the multistore wrapper might need some adjustments when updating to different sdk versions, we have compatible versions for sdk 0.46 and 0.47, 0.50 is on draft PR.

@chillyvee
Copy link
Contributor Author

we have compatible versions for sdk 0.46 and 0.47, 0.50 is on draft PR.

Are the compatible version branches on https://github.com/crypto-org-chain/cronos/tree/main/memiavl ?

@yihuang
Copy link
Collaborator

yihuang commented Mar 1, 2024

we have compatible versions for sdk 0.46 and 0.47, 0.50 is on draft PR.

Are the compatible version branches on https://github.com/crypto-org-chain/cronos/tree/main/memiavl ?

Yeah, currently, you need to follow cronos release branch, v1.0.x for SDK 0.46, main for SDK 0.47.

I'm preparing a standalone repo at https://github.com/crypto-org-chain/memiavl, will make public soon.

@chillyvee
Copy link
Contributor Author

0.50 is on draft PR.

Please share when available. Would like to run on a v0.50 testnet

@yihuang
Copy link
Collaborator

yihuang commented Mar 1, 2024

0.50 is on draft PR.

Please share when available. Would like to run on a v0.50 testnet

it's available here, crypto-org-chain/cronos#1331, it's able to build and run basic devnet, but still many tests to fix.
the changes to memiavl and store wrappers are straightforward though.

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 1, 2024

main for SDK 0.47.

Implemented memiavl for gaiad in PR cosmos/gaia#2973

memiavl works well for gaiad in theta testnet

statesync node over p2p then put into validator mode. snapshots creation also works. No issues so far.

If you have any comments/suggestions regarding method of integration for chains, your feedback is always welcome.

@chillyvee
Copy link
Contributor Author

@yihuang Do you know if the statesync snapshot/restore works properly for memiavl + cosmwasm module chains?

@chillyvee

This comment was marked as duplicate.

@yihuang
Copy link
Collaborator

yihuang commented Mar 2, 2024

@yihuang Do you know if the statesync snapshot/restore works properly for memiavl + cosmwasm module chains?

I think so, it only replaces root ms, the other part is still managed by sdk, but we didn't tested though.

@chillyvee
Copy link
Contributor Author

Was not able to locate memiavl in:
https://github.com/crypto-org-chain/cronos/tree/release/v1.0.x

Found this with a cosmos-sdk v0.46 reference. Is this the right branch for v0.46?
https://github.com/crypto-org-chain/cronos/blob/memiavl/v0.0.5/memiavl/go.mod

Also is it difficult to modify for cosmos-sdk v0.45?

@yihuang
Copy link
Collaborator

yihuang commented Mar 8, 2024

Was not able to locate memiavl in: https://github.com/crypto-org-chain/cronos/tree/release/v1.0.x

Found this with a cosmos-sdk v0.46 reference. Is this the right branch for v0.46? https://github.com/crypto-org-chain/cronos/blob/memiavl/v0.0.5/memiavl/go.mod

It's a bit hard to manage the release, it'll be easier when using a standalone repo, I suggest following the Cronos's go.mod itself, for example, right now it uses v0.0.0-20240130060137-c12016e4052e:

https://github.com/crypto-org-chain/cronos/blob/release/v1.0.x/go.mod#L222

Also is it difficult to modify for cosmos-sdk v0.45?

I think not difficult, since the store interfaces don't change much from 0.45 to 0.46.

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 9, 2024

I think not difficult, since the store interfaces don't change much from 0.45 to 0.46.

Attempting to create a v0.45 branch for memiavl based on Jackal Chain

Here is the current progress code with current errors. Rootmulti is not connecting properly yet.

https://github.com/chillyvee/cronos/tree/cv045wip

https://github.com/chillyvee/canine-chain/tree/cv3.1.3memiavl

data/memiavl.db directory appears

With app.toml [state-sync] configured the error is:

failed to load latest version: state sync snapshots require a rootmulti store

With app.toml [state-sync] removed the error is:

panic: uncached KVStore is not supposed to be accessed directly

node canined starts normally after app.toml [memiavl] is removed

could it be a problem with cosmwasm trying to pin codes early? (InitializePinnedCodes)

goroutine 1 [running]:
github.com/crypto-org-chain/cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?})
        github.com/crypto-org-chain/cronos/store@v0.0.0-00010101000000-000000000000/rootmulti/store.go:231 +0x25
github.com/cosmos/cosmos-sdk/types.Context.KVStore({{0x2aa5c60, 0x3c11940}, {0x7fd020267df8, 0xc0015a4a90}, {{0x0, 0x0}, {0x0, 0x0}, 0x0, {0x0, ...}, ...}, ...}, ...)
        github.com/cosmos/cosmos-sdk@v0.45.17/types/context.go:253 +0x92
github.com/CosmWasm/wasmd/x/wasm/keeper.Keeper.InitializePinnedCodes({{0x2a8c2a0, 0xc000ea0910}, {0x2abc920, 0xc0002f6470}, {0x2a907d0, 0xc0000be750}, {0x2a86ca0, 0xc000eb02e0}, {0x2a863e0, 0xc0001e30f8}, ...}, ...)
        github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/keeper.go:890 +0x79
github.com/jackalLabs/canine-chain/v3/app.NewJackalApp({0x2aa5c98, 0xc000f8c900}, {0x2ab78b8, 0xc000186128}, {0x0, 0x0}, 0x1, 0xc001168e40, {0xc000056468, 0x14}, ...)
        github.com/jackalLabs/canine-chain/v3/app/app.go:1035 +0x9fd8
main.appCreator.newApp({{{0x2aabdb8, 0xc0001204e0}, {0x2abc920, 0xc0002f6470}, {0x2ab4b60, 0xc00047e4c0}, 0xc000672468}}, {0x2aa5c98, 0xc000f8c900}, {0x2ab78b8, ...}, ...)
        github.com/jackalLabs/canine-chain/v3/cmd/canined/root.go:221 +0xa2c
github.com/cosmos/cosmos-sdk/server.startInProcess(_, {{0x0, 0x0, 0x0}, {0x2ac6d90, 0xc000fa2d80}, {0xc00013c878, 0x8}, {0x2aabd70, 0xc0002f6470}, ...}, ...)
        github.com/cosmos/cosmos-sdk@v0.45.17/server/start.go:260 +0x532
github.com/cosmos/cosmos-sdk/server.StartCmd.func2(0xc000347900?, {0x3c11940?, 0x0?, 0x0?})
        github.com/cosmos/cosmos-sdk@v0.45.17/server/start.go:128 +0x1e8
github.com/spf13/cobra.(*Command).execute(0xc000276c00, {0x3c11940, 0x0, 0x0})
        github.com/spf13/cobra@v1.7.0/command.go:940 +0x87c
github.com/spf13/cobra.(*Command).ExecuteC(0xc000313800)
        github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
        github.com/spf13/cobra@v1.7.0/command.go:985
github.com/cosmos/cosmos-sdk/server/cmd.Execute(0x3b1b298?, {0xc000056468, 0x14})
        github.com/cosmos/cosmos-sdk@v0.45.17/server/cmd/execute.go:36 +0x167
main.main()
        github.com/jackalLabs/canine-chain/v3/cmd/canined/main.go:15 +0x26

Removing wasm InitializePinnedCodes and disabling state-sync enables app start. Suspect this is okay since memiavl directly triggers state-sync snapshot Create() after memiavl snapshot is finished.

[memiavl]
  async-commit-buffer = 0
  cache-size = 1000
  enable = true
  snapshot-interval = 61
  snapshot-keep-recent = 3
  zero-copy = false

#[state-sync]
#  snapshot-interval = 61
#  snapshot-keep-recent = 3

and editing canined/app/app.go

diff --git a/app/app.go b/app/app.go
index d712b96c..8513e492 100644
--- a/app/app.go
+++ b/app/app.go
@@ -119,7 +119,7 @@ import (
        tmjson "github.com/tendermint/tendermint/libs/json"
        "github.com/tendermint/tendermint/libs/log"
        tmos "github.com/tendermint/tendermint/libs/os"
-       tmproto "github.com/tendermint/tendermint/proto/tendermint/types"
+       //tmproto "github.com/tendermint/tendermint/proto/tendermint/types"
        dbm "github.com/tendermint/tm-db"
 
        "github.com/CosmWasm/wasmd/x/wasm"
@@ -1029,12 +1029,13 @@ func NewJackalApp(
                if err := app.LoadLatestVersion(); err != nil {
                        tmos.Exit(fmt.Sprintf("failed to load latest version: %s", err))
                }
-               ctx := app.BaseApp.NewUncachedContext(true, tmproto.Header{})
-
+               //ctx := app.BaseApp.NewUncachedContext(true, tmproto.Header{})
                // Initialize pinned codes in wasmvm as they are not persisted there
+               /*
                if err := app.wasmKeeper.InitializePinnedCodes(ctx); err != nil {
                        tmos.Exit(fmt.Sprintf("failed initialize pinned codes %s", err))
                }
+               */
        }

data/memiavl.db/* fills up with data quickly after start

But fails near end of statesync restore where cosmwasm attempts to restore wasm files

INF Fetching snapshot chunk chunk=37 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=33 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=34 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=35 format=1 height=6802000 module=statesync total=38
INF Applied snapshot chunk to ABCI app chunk=36 format=1 height=6802000 module=statesync total=38
panic: uncached KVStore is not supposed to be accessed directly

goroutine 224 [running]:
github.com/crypto-org-chain/cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?})
        github.com/crypto-org-chain/cronos/store@v0.0.0-00010101000000-000000000000/rootmulti/store.go:231 +0x25
github.com/cosmos/cosmos-sdk/types.Context.KVStore({{0x2aa3b20, 0x3c0f940}, {0x7f419807b578, 0xc000dfcd00}, {{0x0, 0x0}, {0x0, 0x0}, 0x67ca50, {0x0, ...}, ...}, ...}, ...)
        github.com/cosmos/cosmos-sdk@v0.45.17/types/context.go:253 +0x92
github.com/CosmWasm/wasmd/x/wasm/keeper.Keeper.InitializePinnedCodes({{0x2a8a160, 0xc000ef85d0}, {0x2aba7e0, 0xc0002d4340}, {0x2a8e690, 0xc0005ff320}, {0x2a84b60, 0xc00037c020}, {0x2a842a0, 0xc00022d378}, ...}, ...)
        github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/keeper.go:890 +0x79
github.com/CosmWasm/wasmd/x/wasm/keeper.finalizeV1({{0x2aa3b20, 0x3c0f940}, {0x7f419807b578, 0xc000dfcd00}, {{0x0, 0x0}, {0x0, 0x0}, 0x67ca50, {0x0, ...}, ...}, ...}, ...)
        github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/snapshotter.go:120 +0x98
github.com/CosmWasm/wasmd/x/wasm/keeper.(*WasmSnapshotter).processAllItems(0xc000121620, 0x1?, {0x2a84880, 0xc0444568a0}, 0x2832b38, 0x2832b28)
        github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/snapshotter.go:155 +0x2b5
github.com/CosmWasm/wasmd/x/wasm/keeper.(*WasmSnapshotter).Restore(0x1cd5340?, 0xc0000caff0?, 0xb664a70?, {0x2a84880?, 0xc0444568a0?})
        github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/snapshotter.go:96 +0x36
github.com/cosmos/cosmos-sdk/snapshots.(*Manager).restoreSnapshot(0xc000510850, {0x67ca50, 0x1, 0x26, {0xc012f9b2a0, 0x20, 0x20}, {{0xc04107c600, 0x26, 0x40}}}, ...)
        github.com/cosmos/cosmos-sdk@v0.45.17/snapshots/manager.go:314 +0x1f2
github.com/cosmos/cosmos-sdk/snapshots.(*Manager).Restore.func1()
        github.com/cosmos/cosmos-sdk@v0.45.17/snapshots/manager.go:271 +0x47
created by github.com/cosmos/cosmos-sdk/snapshots.(*Manager).Restore in goroutine 148
        github.com/cosmos/cosmos-sdk@v0.45.17/snapshots/manager.go:270 +0x205

Looks more like a cosmwasm + memiavl problem and not directly a cosmos-sdk v0.45 + memiavl problem.

Disabling pinned codes after wasm restore finalize
x/wasm/keeper/snapshotter.go

        
func finalizeV1(ctx sdk.Context, k *Keeper) error {
        return nil
        // FIXME: ensure all codes have been uploaded?
        //return k.InitializePinnedCodes(ctx)
}

But restore fails with appHash verification problem

ERR appHash verification failed actual="��\x1f�~t�=+\x06�@\x14\x15�}0-TmznǙ̶x\b\f<��" expected="F�\x1d+��:��\x1d�^��n)�\x14p\x7f\x1e\x17\x02f\x195�r�\b�a" module=statesync
ERR State sync failed err="snapshot restoration failed: verification failed" module=statesync

Have you tried memiavl with other cosmwasm enabled chains?

If you have a moment, can you suggest how to handle cosmwasm wasm file restores? If you don't see reason, please feedback so we can reach out to confio.

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 10, 2024

wasm InitializedPinnedCodes calls cosmos-sdk which calls rootmulti GetKVStore

github.com/crypto-org-chain/cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?})
        github.com/crypto-org-chain/cronos/store@v0.0.0-00010101000000-000000000000/rootmulti/store.go:231 +0x25
github.com/cosmos/cosmos-sdk/types.Context.KVStore({{0x2aa3b20, 0x3c0f940}, {0x7f419807b578, 0xc000dfcd00}, {{0x0, 0x0}, {0x0, 0x0}, 0x67ca50, {0x0, ...}, ...}, ...}, ...)
        github.com/cosmos/cosmos-sdk@v0.45.17/types/context.go:253 +0x92
github.com/CosmWasm/wasmd/x/wasm/keeper.Keeper.InitializePinnedCodes({{0x2a8a160, 0xc000ef85d0}, {0x2aba7e0, 0xc0002d4340}, {0x2a8e690, 0xc0005ff320}, {0x2a84b60, 0xc00037c020}, {0x2a842a0, 0xc00022d378}, ...}, ...)
        github.com/CosmWasm/wasmd@v0.32.0/x/wasm/keeper/keeper.go:890 +0x79

In this case cronos/store/rootmulti.(*Store).GetKVStore(0x0?, {0x0?, 0x0?}) immediately panics

Adding cosmwasm seems to require a way to directly get a named store for wasmd keeper to read and write to.

Should cronos/store/rootmulti.GetKVStore() proxy a call to CacheMultiStore to avoid panic: uncached KVStore is not supposed to be accessed directly?

Attempted to give direct access to stores in this commit:
chillyvee/cronos@4debbd6

This avoids wasm crashing on restore/startup

wasm files appear

But maybe the function call is wrong for GetKVStore so the apphash doesn't match after wasm restore.

Statesync restore without memiavl (enabled=false) succeeds:

INF Verified ABCI app appHash="��o&\">����O$_\r1:.�j�XZ\x02\bˀ6���J�" height=6810000 module=statesync
INF Snapshot restored format=1 hash="��|IB�c\f��P��TDM�)�M\x02)Q�/��\x18\x15��U" height=6810000 module=statesync

@yihuang
Copy link
Collaborator

yihuang commented Mar 10, 2024

uncached KVStore is not supposed to be accessed directly

You can try remove this check, it should be fine especially for read-only cases

@chillyvee
Copy link
Contributor Author

uncached KVStore is not supposed to be accessed directly

You can try remove this check, it should be fine especially for read-only cases

For statesync cosmwasm restore, write access is required. Can you recommend a different object to return?

@chillyvee
Copy link
Contributor Author

chillyvee commented Mar 10, 2024

It's a bit hard to manage the release, it'll be easier when using a standalone repo, I suggest following the Cronos's go.mod itself, for example, right now it uses v0.0.0-20240130060137-c12016e4052e:

Attempted to use the recommended branch/commit for v0.46 target using gitopia chain without cosmwasm

Without memiavl the chain statesync restore is successful

Followed the replace to utilize memiavl within the same branch
https://github.com/crypto-org-chain/cronos/blob/release/memiavl/v0.0.x/store/go.mod#L135

	github.com/crypto-org-chain/cronos/memiavl => ../memiavl

This branch downgrades the commit above to match cosmos-sdk v0.46.13
https://github.com/chillyvee/cronos/tree/gitopia/cv0.46.13store

This branch of gitopia uses the branch above
chillyvee/gitopia@3a4f8fe

On statesync restore, there is a apphash mismatch error

@yihuang
Copy link
Collaborator

yihuang commented Mar 11, 2024

It's a bit hard to manage the release, it'll be easier when using a standalone repo, I suggest following the Cronos's go.mod itself, for example, right now it uses v0.0.0-20240130060137-c12016e4052e:

Attempted to use the recommended branch/commit for v0.46 target using gitopia chain without cosmwasm

Without memiavl the chain statesync restore is successful

Followed the replace to utilize memiavl within the same branch https://github.com/crypto-org-chain/cronos/blob/release/memiavl/v0.0.x/store/go.mod#L135

	github.com/crypto-org-chain/cronos/memiavl => ../memiavl

This branch downgrades the commit above to match cosmos-sdk v0.46.13 https://github.com/chillyvee/cronos/tree/gitopia/cv0.46.13store

This branch of gitopia uses the branch above chillyvee/gitopia@3a4f8fe

On statesync restore, there is a apphash mismatch error

with sdk 0.46/0.45, the option memiavlSdk46Compact should be set to true, can you give that a try?
there's a breaking change between sdk 0.47 and older versions on root hash calculation, that's this option is all about.

@yihuang
Copy link
Collaborator

yihuang commented Mar 11, 2024

uncached KVStore is not supposed to be accessed directly

You can try remove this check, it should be fine especially for read-only cases

For statesync cosmwasm restore, write access is required. Can you recommend a different object to return?

Just tagged a new version, please update to memiavl v0.0.6, store v0.0.6, (versiondb v0.0.6 if used).
The above check is removed, there's no fundamental issue with accessing uncached stores, just be aware that you won't read uncommitted changes.

@chillyvee
Copy link
Contributor Author

with sdk 0.46/0.45, the option memiavlSdk46Compact should be set to true, can you give that a try?

Success on statesync for jackal, memiavlSdk46Compact = true
Success on statesync for gitopia, memiavlSdk46Compact = true

Just tagged a new version, please update to memiavl v0.0.6, store v0.0.6, (versiondb v0.0.6 if used).

Had some trouble during go mod tidy due to some cosmos-sdk dependencies. Will feedback later after resolved.

Tagging with matching cosmos-sdk versions can be helpful for using the right version. memiavl/cosmos-sdk0.46.13 for example. You can use the idea after memiavl is split into a dedicated repo.

@yihuang
Copy link
Collaborator

yihuang commented Mar 11, 2024

with sdk 0.46/0.45, the option memiavlSdk46Compact should be set to true, can you give that a try?

Success on statesync for jackal, memiavlSdk46Compact = true Success on statesync for gitopia, memiavlSdk46Compact = true

great!

Just tagged a new version, please update to memiavl v0.0.6, store v0.0.6, (versiondb v0.0.6 if used).

Had some trouble during go mod tidy due to some cosmos-sdk dependencies. Will feedback later after resolved.

Tagging with matching cosmos-sdk versions can be helpful for using the right version. memiavl/cosmos-sdk0.46.13 for example. You can use the idea after memiavl is split into a dedicated repo.

you'll need a little bit change in the versiondb setup code, please check the patch here: crypto-org-chain/cronos#1339

@chillyvee
Copy link
Contributor Author

you'll need a little bit change in the versiondb

Thank you. Will try again when there is time and share the result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants