Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarted W10 node sometimes becomes stuck adding new block (onPersist insufficient funds) #3273

Open
EdgeDLT opened this issue Dec 19, 2023 · 13 comments
Labels
bug Something isn't working I3 Minimal impact S4 Routine U3 Regular
Milestone

Comments

@EdgeDLT
Copy link

EdgeDLT commented Dec 19, 2023

Current Behavior

Attempting to start an out-of-sync node produces:

2023-12-19T15:57:41.356Z        WARN    blockQueue: failed adding block into the blockchain     {"error": "onPersist failed: VM has failed: at instruction 0 (SYSCALL): insufficient funds", "blockHeight": 4578499, "nextIndex": 4578500}

Expected Behavior

The node should successfully sync the next block.

Possible Solution

Something related to #3110 maybe? Not looked too deep into it.

Steps to Reproduce

  1. Fully or partially sync a node then kill it.
  2. Restart the node. If it progresses, kill it again.
  3. Repeat 2 until the error occurs.

Context

Ran into this issue with my indexer node. By default the node is not kept alive, it is fully synchronized then shut down once all operations are complete. Occasionally, but not always, starting the node again produces this error. If it does, no further progress can be made without a full resync.

Your Environment

Windows 10

> ./neogo -v
NeoGo
Version: 0.104.0
GoVersion: go1.21.4
@EdgeDLT EdgeDLT added the bug Something isn't working label Dec 19, 2023
@roman-khimov
Copy link
Member

Are you using Level or Bolt?

@EdgeDLT
Copy link
Author

EdgeDLT commented Dec 19, 2023

LevelDB.

@roman-khimov
Copy link
Member

That's more interesting. It obviously means a broken DB, but the way it's handled internally it's not trivial for it to get into this state. Yet if it happens, there should be a reason for that.

CC @532910.

@EdgeDLT
Copy link
Author

EdgeDLT commented Jan 3, 2024

Uncovered some memory issues on my machine in recent days. Could well be a hardware fault that is responsible, so if you can't replicate, don't spend too long on it.

@AnnaShaleva
Copy link
Member

AnnaShaleva commented Jan 12, 2024

I was able to reproduce a similar bug on my local machine under the following conditions:

  1. Low disk space left (< 100 Mb), LevelDB is used (BoltDB can't function at all in such restricted conditions).
  2. DB resync from trusted dump is started, but the process was killed by OOM killer.
  3. Free a bit of disk space.
  4. DB resync from trusted dump is resumed and in a while I got:
2024-01-12T11:46:43.686+0300	WARN	contract invocation failed	{"tx": "591473831fa9508870afb22d89e595edb40306ff49bc2373d6b9bbc87abbda0e", "block": 2913226, "error": "at instruction 1 (SYSCALL): method 'Callback' not found"}
2024-01-12T11:46:43.700+0300	WARN	contract invocation failed	{"tx": "40d7d6be4922724174072b7c60f0cce175a0d470a23235b02e584fafcb9d0ec7", "block": 2913281, "error": "at instruction 1 (SYSCALL): method 'Callback' not found"}
2024-01-12T11:46:44.011+0300	INFO	shutting down service	{"service": "Prometheus", "endpoint": "[::]:2112"}
2024-01-12T11:47:21.743+0300	INFO	persisted to disk	{"blocks": 5852, "keys": 118232, "headerHeight": 2876784, "blockHeight": 2876783, "took": "43.997511933s"}
2024-01-12T11:47:36.029+0300	INFO	persisted to disk	{"blocks": 37886, "keys": 774298, "headerHeight": 2914670, "blockHeight": 2914669, "took": "14.285445886s"}
failed to add block 2914670: onPersist failed: VM has failed: at instruction 0 (SYSCALL): insufficient funds

I also was able to reproduce the same problem during network syncing with low disk space, so from my side this problem also seems to be connected with hardware problems.

@roman-khimov
Copy link
Member

Do you have the DB still? Can we check what's the state difference? It shouldn't happen this way, whatever written must be consistent, it's a single transaction after all (or the DB itself is broken if it can't ensure consistency on commit failure because of limited disk space).

@AnnaShaleva
Copy link
Member

AnnaShaleva commented Jan 12, 2024

I don't, I've removed the DB to test the release compatibility, but for some reason I suspect that the problem can be reproduced. I'll try to reproduce it and save the DB to look at the diffs.

@roman-khimov roman-khimov modified the milestones: v0.106.0, v0.107.0 Mar 22, 2024
@fyfyrchik
Copy link
Contributor

I have reproduced the same problem with boltdb, but can't share the database, unfortunately.
My plan is to look into all the keys we have and look at the diff between corrupted db and freshly restored one.

@roman-khimov @AnnaShaleva Do you have any specific place in mind to look into?

@fyfyrchik
Copy link
Contributor

fyfyrchik commented Apr 5, 2024

Oh, I was too soon to write, my problem seems a bit different:

2024-04-05T14:33:03.605Z        WARN        blockQueue: failed adding block into the blockchain        {"error": "transaction 66385f00f74653c40ff2b0aaa07b394670abf853977ac309755a5c86d4e36aca failed to verify: insufficient funds", "blockHeight": 36734, "nextIndex": 36735}

But there are other nodes which are OK, so it seems an application/db issue, likely related to this one.

@roman-khimov
Copy link
Member

Zero ideas on this one. It'd be cool if you'll be able to at least do the diff between proper state and whatever is in this broken DB. Luckily, your chain doesn't seem to be long.

@fyfyrchik
Copy link
Contributor

fyfyrchik commented Apr 8, 2024

My version of neo-go is 0.104 with some commits after (base at 441eb8a) with fixes from #3279 (we reverted and reapplied #3110, but there were no conflicts so I do not expect problems here) and some unrelated improvements in the network server (mostly for multi-IP and TLS)

  1. I have a supposedly corruppted DB which contains blocks up to 36734 and some headers after.
  2. From this node I have exported chain dump.
  3. I restored dump up to 36734 block to get a fresh database.
  4. Dumped all they keys in hex to a file and compared diff:

Sanity check: values at 0xc1 are different as corrupted node has acquired some headers, values at 0xc0 are the same.

That being said here are seemingly relevant diffs stored by 0x70 prefix (contract data):
Native neo (0xfbffffff == -5)
0x17 = 23 = prefixVoterRewardPerCommittee

< 70fbffffff0e: 400841022821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff21004102282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac09021004102282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c2100410228210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9210041022821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c21004102282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab1908221004102282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b721004102282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b16372100
< 70fbffffff1410ee5df183cc0f45e9beda75964628ee0471f9f8: 4104210420bcbe002103778f002821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c21060080d8ce6905
< 70fbffffff143a017ecb1a4f0afd95e28a856109c006db91a2e2: 4104210420bcbe0021037e8f002821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff21060080d8ce6905
< 70fbffffff145061b45aa4b5fa5293310a5300f0968099f4d145: 4104210420bcbe0021037d8f00282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac09021060080d8ce6905
< 70fbffffff145a296b9b7035d8c1dc74ef80bb7ea35be8decfa0: 4104210420bcbe002103598f00282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c21060080d8ce6905
< 70fbffffff145fd4c6252239f6d810987dd4344ed8fe6cf2a5f8: 4104210420bcbe0021037b8f0028210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c921060080d8ce6905
< 70fbffffff1485d8114bf1c98b3cb0511d335e95bbc5577d241b: 4104210420bcbe0021036f8f00282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b721060080d8ce6905
< 70fbffffff14d6271d847b9f5c752b774f783c3ec540b87b0856: 4104210420bcbe0021036a8f00282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b163721060080d8ce6905
< 70fbffffff14f33a2a2297f7f25642289180eb8f015ab9f8458a: 4104210420bcbe002103398f00282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab1908221060080d8ce6905
< 70fbffffff17020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff: 0080d8ce6905
< 70fbffffff1702731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac090: 0080d8ce6905
< 70fbffffff1702780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c: 0080d8ce6905
< 70fbffffff170320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9: 0080d8ce6905
< 70fbffffff17034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c: 0080d8ce6905
< 70fbffffff1703ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab19082: 0080d8ce6905
< 70fbffffff1703d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7: 0080d8ce6905
< 70fbffffff1703e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637: 0080d8ce6905
---
> 70fbffffff0e: 400841022821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff210420bcbe00410228210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9210420bcbe0041022821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c210420bcbe004102282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac090210420bcbe004102282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c210420bcbe004102282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab19082210420bcbe004102282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7210420bcbe004102282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637210420bcbe00
> 70fbffffff1410ee5df183cc0f45e9beda75964628ee0471f9f8: 4104210420bcbe002103778f002821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c210600205917580d
> 70fbffffff143a017ecb1a4f0afd95e28a856109c006db91a2e2: 4104210420bcbe0021037e8f002821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff2106004015d6580d
> 70fbffffff145061b45aa4b5fa5293310a5300f0968099f4d145: 4104210420bcbe0021037d8f00282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac0902106004015d6580d
> 70fbffffff145a296b9b7035d8c1dc74ef80bb7ea35be8decfa0: 4104210420bcbe002103598f00282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c210600c024db550d
> 70fbffffff145fd4c6252239f6d810987dd4344ed8fe6cf2a5f8: 4104210420bcbe0021037b8f0028210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c92106004015d6580d
> 70fbffffff1485d8114bf1c98b3cb0511d335e95bbc5577d241b: 4104210420bcbe0021036f8f00282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7210600009d58570d
> 70fbffffff14d6271d847b9f5c752b774f783c3ec540b87b0856: 4104210420bcbe0021036a8f00282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637210600009d58570d
> 70fbffffff14f33a2a2297f7f25642289180eb8f015ab9f8458a: 4104210420bcbe002103398f00282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab190822106004034e0520d
> 70fbffffff17020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff: 004015d6580d
> 70fbffffff1702731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac090: 004015d6580d
> 70fbffffff1702780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c: 004015d6580d
> 70fbffffff170320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9: 004015d6580d
> 70fbffffff17034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c: 004015d6580d
> 70fbffffff1703ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab19082: 004015d6580d
> 70fbffffff1703d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7: 004015d6580d
> 70fbffffff1703e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637: 004015d6580d

Native notary (0xf6ffffff == -10):

< 70f6ffffff013e93a75eaa8181d369821e4954e179f6a1731d82: 41022105973a799b072105ffffffff00
---
> 70f6ffffff013e93a75eaa8181d369821e4954e179f6a1731d82: 41022105df21f8e4082105ffffffff00
2657172c2757143
< 70f6ffffff01829cb414d5790e699a889db54a0d6b5d29bf2f08: 410221066b73b712e6002105ffffffff00
---
> 70f6ffffff01829cb414d5790e699a889db54a0d6b5d29bf2f08: 410221061820a8850c012105ffffffff00
2657198c2757169
< 70f6ffffff01ac77a8afa4216f24b548e236428d03024bca86ab: 41022106ff1db1f4e4002105ffffffff00
---
> 70f6ffffff01ac77a8afa4216f24b548e236428d03024bca86ab: 4102210630c082270b012105ffffffff00
2657212c2757183
< 70f6ffffff01c0cf46845aee71fcc27fd062253d547a83be83a3: 41022106a342145ae5002105ffffffff00
---
> 70f6ffffff01c0cf46845aee71fcc27fd062253d547a83be83a3: 41022106ada5bc9f0b012105ffffffff00
2657219c2757190
< 70f6ffffff01c857e2afbf2d669ede44ebf1a0ce022c54497c04: 41022105654b581c092105ffffffff00
---
> 70f6ffffff01c857e2afbf2d669ede44ebf1a0ce022c54497c04: 410221057f25f0a20a2105ffffffff00
2657224c2757195
< 70f6ffffff01d8da4f4721649e2a8144dbf183e8b3e030c8e6cd: 410221063c8a0135e5002105ffffffff00
---
> 70f6ffffff01d8da4f4721649e2a8144dbf183e8b3e030c8e6cd: 410221069aed55720b012105ffffffff00
2657235c2757206
< 70f6ffffff01e43c77ed610894ee38215cd9fd56eb2166dca74c: 41022106793bdc12e6002105ffffffff00
---
> 70f6ffffff01e43c77ed610894ee38215cd9fd56eb2166dca74c: 4102210605d7bf790c012105ffffffff00

There is also diff for GAS contract, it is a bit bigger, won't attach here.

Script I used to produce files:

func TestExport(t *testing.T) {
	runExport(t, "./mainnet.bolt")
	runExport(t, "./corrupted.mainnet.bolt")
}

func runExport(t *testing.T, path string) {
	db, err := bbolt.Open(path, os.ModePerm, &bbolt.Options{
		Timeout:  time.Second,
		ReadOnly: true,
	})
	require.NoError(t, err)
	defer db.Close()

	f, err := os.OpenFile(path+".keys", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, os.ModePerm)
	require.NoError(t, err)
	defer f.Close()

	require.NoError(t, db.View(func(tx *bbolt.Tx) error {
		printBucket(f, tx.Bucket([]byte("DB")), "")
		return nil
	}))
}

func printBucket(f *os.File, b *bbolt.Bucket, indent string) {
	c := b.Cursor()
	for k, v := c.First(); k != nil; k, v = c.Next() {
		if v != nil {
			f.WriteString(indent)
			f.WriteString(hex.EncodeToString(k))
			f.WriteString(": ")
			f.WriteString(hex.EncodeToString(v))
			f.WriteString("\n")
		} else {
			printBucket(f, b.Bucket(k), indent+hex.EncodeToString(k)+" : ")
		}
	}
}

AnnaShaleva added a commit that referenced this issue May 17, 2024
Account is blocked when it's in the Policy's storage. Introduced in
bbbc680.

This bug leads to the fact that during native Neo cache initialization
at the last block in the dBFT epoch, all candidates accounts are
"blocked", and thus, stand-by committee and validators are used in the
subsequent new epoch. Close #3424.

This bug may lead to the consequenses described in #3273, but it needs
to be confirmed.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
AnnaShaleva added a commit that referenced this issue May 17, 2024
Account is blocked when it's in the Policy's storage, not when it's
missing from the Policy storage. Introduced in
bbbc680.

This bug leads to the fact that during native Neo cache initialization
at the last block in the dBFT epoch, all candidates accounts are
"blocked", and thus, stand-by committee and validators are used in the
subsequent new epoch. Close #3424.

This bug may lead to the consequenses described in #3273, but it needs
to be confirmed.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
AnnaShaleva added a commit that referenced this issue May 17, 2024
Account is blocked when it's in the Policy's storage, not when it's
missing from the Policy storage. Introduced in
bbbc680.

This bug leads to the fact that during native Neo cache initialization
at the last block in the dBFT epoch, all candidates accounts are
"blocked", and thus, stand-by committee and validators are used in the
subsequent new epoch. Close #3424.

This bug may lead to the consequences described in #3273, but it needs
to be confirmed.

Signed-off-by: Anna Shaleva <shaleva.ann@nspcc.ru>
@AnnaShaleva
Copy link
Member

@EdgeDLT, @fyfyrchik, please take a look at the #3443. It should fix the problem, but it needs to be confirmed, thus I'm not closing this issue.

@AnnaShaleva AnnaShaleva modified the milestones: v0.107.0, v0.106.1 May 21, 2024
@AnnaShaleva
Copy link
Member

@EdgeDLT, @fyfyrchik, 0.106.0 is out, you may try it. Please, write back if succeeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working I3 Minimal impact S4 Routine U3 Regular
Projects
None yet
Development

No branches or pull requests

4 participants