Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove panics in the library code #447

Open
cenkalti opened this issue Apr 1, 2023 · 5 comments
Open

Remove panics in the library code #447

cenkalti opened this issue Apr 1, 2023 · 5 comments

Comments

@cenkalti
Copy link
Member

cenkalti commented Apr 1, 2023

A library should not panic at all. Panics are usually not handled in Go programs and it makes the whole program crashes when a library function panics.

Currently there are 12 open issues involving panic, especially working with corrupted databases. BoltDB should return the error as an nice message.

Overview of panics from boltdb code:

% ag 'panic\(.*\)' | grep -v _test.go
freelist_hmap.go:167:           panic("pgids not sorted")
db.go:1067:     panic("bolt.DB.meta(): invalid meta pages")
db.go:1151:                     panic("freepages: failed to rollback tx")
db.go:1155:             panic("freepages: failed to open read only tx")
db.go:1163:                     panic(fmt.Sprintf("freepages: failed to get all reachable pages (%v)", e))
internal/common/page.go:136:                    panic(fmt.Sprintf("leading element count %d overflows int", c))
internal/common/page.go:361:            panic(fmt.Errorf("mergepgids bad len %d < %d + %d", len(dst), len(a), len(b)))
internal/common/meta.go:42:             panic(fmt.Sprintf("root bucket pgid (%d) above high water mark (%d)", m.root.root, m.pgid))
internal/common/meta.go:45:             panic(fmt.Sprintf("freelist pgid (%d) above high water mark (%d)", m.freelist, m.pgid))
freelist.go:119:                        panic(fmt.Sprintf("invalid page allocation: %d", id))
freelist.go:157:                panic(fmt.Sprintf("cannot free page 0 or 1: %d", p.Id()))
freelist.go:177:                        panic(fmt.Sprintf("page %d already freed", id))
freelist.go:269:                panic(fmt.Sprintf("invalid freelist page: %d, page type is %s", p.Id(), p.Typ()))
cursor.go:276:          panic(fmt.Sprintf("invalid page type: %d: %x", p.Id(), p.Flags()))
internal/common/utils.go:11:            panic(fmt.Sprintf("assertion failed: "+msg, v...))
internal/btesting/btesting.go:110:              panic("Please call Close() before MustReopen()")
bucket.go:563:                  panic(fmt.Sprintf("misplaced bucket header: %x -> %x", []byte(name), k))
bucket.go:566:                  panic(fmt.Sprintf("unexpected bucket header flag: %x", flags))
bucket.go:584:          panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", b.rootNode.pgid, b.tx.meta.Pgid()))
bucket.go:717:                  panic(fmt.Sprintf("inline bucket non-zero page access(2): %d != 0", id))
tx.go:212:                      panic("check fail: " + strings.Join(errs, "\n"))
mlock_windows.go:5:     panic("mlock is supported only on UNIX systems")
mlock_windows.go:10:    panic("munlock is supported only on UNIX systems")
node.go:76:             panic(fmt.Sprintf("invalid childAt(%d) on a leaf node", index))
node.go:119:            panic(fmt.Sprintf("pgId (%d) above high water mark (%d)", pgId, n.bucket.tx.meta.Pgid()))
node.go:121:            panic("put: zero-length old key")
node.go:123:            panic("put: zero-length new key")
node.go:190:            panic(fmt.Sprintf("inode overflow: %d (pgid=%d)", len(n.inodes), p.Id()))
node.go:331:                    panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", p.Id(), tx.meta.Pgid()))
@infdahai
Copy link

infdahai commented May 4, 2023

I fix this.

@ptabor
Copy link
Contributor

ptabor commented May 4, 2023

A library should not panic at all.

I don't agree with this statement in this specific context:

  • In general libraries should not panic to communicate a recoverable error, that e.g. is caused by the wrong invocation of the library.
  • In case of assertions that signal unexpected condition that is not recoverable, I think that panic is significantly better than the error that:
    a) might get ignored
    b) might mean that the system will keep operating on the partially corrupted state (no clean rollback after the error or risk of hitting the same conditions again and again, leading to resource starvation)

Looking at the error list above... most mean that the underlying bbolt file is corrupted or inconsistent.
The chances of recovery of such file are big if the file is not subject to continous mutations and the problem will be surfaced soon after the corruption happens.

Moreover panic in golang is not 'terminal'. If the user of the library is determined to continue (e.g. because of hosting multiple instances in a single process), the panic can get explicitly recovered.

To summarise, I recommend:

  1. Keep it panic whenever there is chance we just corrupted DB
  2. Keep it error when it's user fault (btesting.go:110: panic("Please call Close() before MustReopen()") ? )
  3. There is middle-ground for this rules: When you open in RO mode already corrupted file and the check fails: There is slim chance you just corrupted the file, so it's a reason for error. At the same time keeping all code to be consistent around panicking on data inconsistency is easier to maintain. What we can do, is to warn callers that the DB.Open operation might panic when the file is corrupted and the user needs to 'recovery' if they need.

@infdahai
Copy link

infdahai commented May 4, 2023

I try this in infdahai@7d07d9f. And I agree with @ptabor said. We can replace some panics with error messages, such as panic(fmt.Sprintf("page %d already freed", id)). But other panics are good indications of the DB crash.

@ptabor
Copy link
Contributor

ptabor commented May 4, 2023

"page %d already freed" is also indicator of either bug in freelist implementation or memory stomping or ram/cpu corruption. It shouldn't happen regardless of bbolt lib user actions.

@cenkalti
Copy link
Member Author

cenkalti commented May 8, 2023

keeping all code to be consistent around panicking on data inconsistency is easier to maintain

I definitely agree on this. However, I think this should not be the default behavior. Think about these 2 cases:

  • BoltDB may be part of a bigger program that other parts has nothing to do with the database and can continue to operate normally.
  • The corruption only causes an error in write path (i.e. freelist structure). Read operations should be able to continue to operate normally.

Panicking takes the whole program down which is not an ideal decision for a library in my opinion.

I propose this:

  • When corruption is detected, instead of panicking, return a CorruptionError to the caller.
  • Mark the db as corrupted with a boolean flag. In this state, database switches into read-only mode.
  • All subsequent mutation operations on db will return CorruptionError wrapping the original error.

There is also a special case. Recover only works when called from the same goroutine as the panic is called in. For example:

bbolt/db.go

Lines 1178 to 1182 in 56e033b

go func() {
for e := range ech {
panic(fmt.Sprintf("freepages: failed to get all reachable pages (%v)", e))
}
}()

This code block executes when opening a db and it's not possible to recover that panic.


In overall, I see corruption errors as non-fatal. I think it's nicer to let the user decide by returning regular errors.

@cenkalti cenkalti self-assigned this May 18, 2023
@github-actions github-actions bot added the stale label Apr 17, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
@ahrtr ahrtr reopened this May 9, 2024
@ahrtr ahrtr removed the stale label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants