Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring state-sync snapshot takes a long time #698

Open
webmaster128 opened this issue Jan 22, 2023 · 5 comments
Open

Restoring state-sync snapshot takes a long time #698

webmaster128 opened this issue Jan 22, 2023 · 5 comments

Comments

@webmaster128
Copy link
Member

I recently started to play with using state-sync to create a fresh node in the Nois testnet. There we observed that applying the last chunk of the snapshot takes approximately 1 hour and requires something between 13 and 14 GB of memory. As many community members consider this unexpectedly long, I though it's worth debugging what's going on here.

Network

  • Chain ID: nois-testnet-003
  • Started: 2022-10-07 (~3.5 months ago)
  • Height: 3.4 million
  • Node: wasmd 0.29.0-rc2 with custom build script, including cosmos-sdk v0.45.8, iavl v0.19.1, tendermint v0.34.21, tm-db v0.6.7

Snapshot

  • Height: 3404000
  • Chunks: 18 (i.e. ~180 MB in total)

System and utalization

  • VPS, 4 vCPU, 16GB Bildschirmfoto 2023-01-22 um 22 33 24
  • Swap: none
  • All 4 cores are under 90-100% load
  • Memory usage raises up to 13 GB, sometimes slights above. In earlier attempts with 8 GB or 4 GB machines the node process was killed due to OOM.

Logs

The following logs show how the snapshot is downloaded easily, the fist 17 chunks are applied within 1 minute each and the last chunk takes 62 minutes to apply.

Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="�ܸ\vU\x15���>Vx��n�ʜ|���\x1a\x18�����V�F" height=3402000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash=" K�֍�T�\x11�u���\x1bUch�������w�6N+�\v" height=3400000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="���w!�7�֗�e��bDBd'�=\x1aJ�F��\b�4�T" height=3398000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="\x01ƹ\\'��\x17+�f�X�i�Y�,�S,p:3~Ҧ�g\x1b�" height=3396000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="��,�\x02��\aj�\t\x7f�����7��G4\x16�m�01ʖȟ" height=3236000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="q�QI�}\x04_D�����S�{�I�\x19�dA� *�-g\n�" height=3234000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="����{\x02\"��6sm\x01�p\x05~҂�f+\x00O4�&\x04��c�" height=2508000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="O�(�t��B��FI\f*3��J��v�\x01\x1dIz8��\x1e��" height=3400500 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="\x18�p�����~�f �@�q��mב�\x17��ֈ�[�I�" height=3397500 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="��\a�Df\x03c\x0f����ԑ�\x11\x0e)\x1d>�I���J\x1c�\x1c�h" height=2914000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="=[aO�+ʤ�$���MA����!-:\b�D��� ػ�" height=2904000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=A0A5D1A7BF2AF5B58E6E8414A0830E1DD105E33E7EF471452DE8058DA9EC535A height=3404002 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Offering snapshot to ABCI app format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Snapshot accepted, restoring format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=2 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=0 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=1 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=3 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=1152E87653548B640646D894483B3177A923A7B91BD687F0BA585DBCED8A1E83 height=3404000 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=A0A5D1A7BF2AF5B58E6E8414A0830E1DD105E33E7EF471452DE8058DA9EC535A height=3404002 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=1152E87653548B640646D894483B3177A923A7B91BD687F0BA585DBCED8A1E83 height=3404000 module=light
Jan 22 19:48:00 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=4 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:02 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=5 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:02 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=0 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=6 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=1 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=2 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:06 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=7 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:06 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=3 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:08 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=8 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:08 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=4 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:10 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=9 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:10 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=5 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:12 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=10 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:13 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=6 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:14 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=11 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:16 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=12 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:18 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=13 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:20 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=14 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:22 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=15 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:24 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=16 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:25 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=17 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:55 syncer noisd[7744]: 7:48PM INF Discovered new snapshot format=1 hash="�b%�\x02t����\\\x18�۾��D�\x12-���h���t���" height=3300000 module=statesync
Jan 22 19:49:23 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=7 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="\x1c\n�w�E2.T�5�ٽX\x0f`�����iV��ҶZ=\x06\n" height=3348000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="��\fd�H\x17(��3�5\x04ME�|\x1b%�à�\x04-G��\x1cǫ" height=3270000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="�u���.S\t��>\x04�pk����D]��Q\x14W\b�G�M�" height=3268000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="�\f\x1c\x1c���W̪Kf}�^�����0�5�!��n\x13�d\x11" height=3266000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="��\x14��ݡ\x10�\x1fǺ@_\x02l��y^�\x17\x1d,\x14��Q�'�k" height=3264000 module=statesync
Jan 22 19:49:32 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=8 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:34 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=9 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:59 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=10 format=1 height=3404000 module=statesync total=18
Jan 22 19:50:33 syncer noisd[7744]: 7:50PM INF Applied snapshot chunk to ABCI app chunk=11 format=1 height=3404000 module=statesync total=18
Jan 22 19:51:12 syncer noisd[7744]: 7:51PM INF Applied snapshot chunk to ABCI app chunk=12 format=1 height=3404000 module=statesync total=18
Jan 22 19:51:50 syncer noisd[7744]: 7:51PM INF Applied snapshot chunk to ABCI app chunk=13 format=1 height=3404000 module=statesync total=18
Jan 22 19:52:47 syncer noisd[7744]: 7:52PM INF Applied snapshot chunk to ABCI app chunk=14 format=1 height=3404000 module=statesync total=18
Jan 22 19:53:40 syncer noisd[7744]: 7:53PM INF Applied snapshot chunk to ABCI app chunk=15 format=1 height=3404000 module=statesync total=18
Jan 22 19:54:26 syncer noisd[7744]: 7:54PM INF Applied snapshot chunk to ABCI app chunk=16 format=1 height=3404000 module=statesync total=18
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="\"�\r-�d�\b�>�)������\r\x04\x11��\a�\x11�N��ȡ" height=3392000 module=statesync
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="��Uޔ�s��-\x03���\x1bV\x1a��㢵!�T\x05L���\"\x7f" height=3388000 module=statesync
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="HF�}\x13ˉ���uh=�����J�v��҂�۰\a!@Q" height=3384000 module=statesync
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Applied snapshot chunk to ABCI app chunk=17 format=1 height=3404000 module=statesync total=18
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Verified ABCI app appHash="�\\�oEا(b�\af���fԅ��\x1e�����Y�m���" height=3404000 module=statesync
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Snapshot restored format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync

Data/wasm directory

  • Size:

    $ du -hs ~/.noisd/data/*
    3.1G	/root/.noisd/data/application.db
    28M	/root/.noisd/data/blockstore.db
    125M	/root/.noisd/data/cs.wal
    16K	/root/.noisd/data/evidence.db
    4.0K	/root/.noisd/data/priv_validator_state.json
    20K	/root/.noisd/data/snapshots
    14M	/root/.noisd/data/state.db
    15M	/root/.noisd/data/tx_index.db
    
  • Element count:

    ~/.noisd/data/application.db    31441423
    ~/.noisd/data/blockstore.db     10073
    ~/.noisd/data/evidence.db       1
    ~/.noisd/data/state.db          6050
    ~/.noisd/data/tx_index.db       458683
    
  • Wasm files are restored using the extension from wasmd. But those are only 11 small files stored to disk:

    ls -lAh ~/.noisd/wasm/wasm/state/wasm/
    total 4.0M
    -rw-r--r-- 1 root root 482K Jan 22 20:56 298c8955ce09b32f261b824765d44e6892f9ae2f03fcd926d83ac47748f328d3
    -rw-r--r-- 1 root root 200K Jan 22 20:56 6cec0fbc37e5c5da8774c344d5811989ce2178d8f2a17070120ad594a22bdce2
    -rw-r--r-- 1 root root 421K Jan 22 20:56 70b58a1bcf77580016c4e8831041b617823073f27fee9f276ab171166c516c7f
    -rw-r--r-- 1 root root 158K Jan 22 20:56 83f24155b5120a00c8f31aaefaa47f0a962b068ef5fa72950c78dab94c3a9272
    -rw-r--r-- 1 root root 239K Jan 22 20:56 97c168d786d30a3d0cf1b94e488191842d0119c56db40fdacd57f9083bb798ea
    -rw-r--r-- 1 root root 477K Jan 22 20:56 b999602332b97208d4cab78bbe7aea859153a12562eed1fced55816b1be9c197
    -rw-r--r-- 1 root root 129K Jan 22 20:56 b9bb582396128a14936a1bc6c15ddb43d604bd70acd2260be9a2bd0a4a7d272c
    -rw-r--r-- 1 root root 480K Jan 22 20:56 dfbc61e0cf078911383b2fa286adeee4b10842df2c9f9f70d73b6dea3f42c243
    -rw-r--r-- 1 root root 207K Jan 22 20:56 ea99f400369f52a5472ff3cbf7f9a2c36e3d382d2ab1261bcc0e9db2ad4b10be
    -rw-r--r-- 1 root root 620K Jan 22 20:56 edce62230570224683397fdb9f7df19ee7f807b9960ecf1b40ab2efe5879ace8
    -rw-r--r-- 1 root root 626K Jan 22 20:56 f960f6a7560c45fbb2cbbe9d207d291653bbab1fddf920b2a2e9115b0e3ac9aa
    

Profiling

Some initial profiling data provided by @jhernandezb shows a lot of compating activity. Maybe he can elaborate more on the profiling of things.

Screenshot_2023-01-21_at_10 27 14_AM

Screenshot_2023-01-21_at_10 27 39_AM

Other notes

  • Validator @polkachu reports snapshot restore times in the order of magnitude of 1 hour for Nois testnet, Sei and EVMOS. Other chains are much faster.
  • Stargazes has way more .wasm codes but their state syncs are significantly faster
  • The 31 million KV entries in application.db might be more than typical
  • In this discussion, a user reports significant write slowdowns when a certain number of keys is reached

My gut feeling here is that at a certain number of database elements, things become slow. But I am far from a Go profiling expert or database expert and don't know how to debug this further. For us it is not a big deal right now and we can try to use less KV elements for the app. But maybe this is a good chance to prevent other mainnets running into problems.

@PolkachuIntern
Copy link

Our state-sync scripts run every Sunday, so this issue is rather timely. Nois testnet state-sync took a bit over an hour today. In contrast, Stargaze mainnet state-sync took about 10 minutes. I am not knowledgable enough to articulate the reasons behind it. As Simon mentioned, Evmos and Sei state-sync is also slow and at times unreliable.

@kaisbaccour
Copy link

Would it make sense to make the inventory of the number of application.db KV entries for all these networks and some other very busy networks to say if the rule stands or no?

@jhernandezb
Copy link

jhernandezb commented Jan 23, 2023

Adding profiles:
cpu.pb.gz
mem.pb.gz

Binary can be built using:
https://github.com/noislabs/full-node/tree/main

One thing to mention is that stargaze still has fast node disabled by default (will be changed to enabled in our next upgrade).

Seems that fast node is contributing to more cpu cycles and garbage collection

@tac0turtle
Copy link
Member

tac0turtle commented Mar 6, 2023

there are a couple issue in iavl to assist with this. Here is one pr that aims to increase the speed. #664.

This issue could be to how large your current state is? ill transfer this issue to iavl so it can be focused on where the actual problem may lie. Lots of the compacting is being cleaned up as we speak.

@tac0turtle tac0turtle transferred this issue from cosmos/cosmos-sdk Mar 6, 2023
@tac0turtle
Copy link
Member

with iavlv1 this should be significantly faster. Would you want to test with this version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants