Remove panics in the library code #447

cenkalti · 2023-04-01T00:19:49Z

A library should not panic at all. Panics are usually not handled in Go programs and it makes the whole program crashes when a library function panics.

Currently there are 12 open issues involving panic, especially working with corrupted databases. BoltDB should return the error as an nice message.

Overview of panics from boltdb code:

% ag 'panic\(.*\)' | grep -v _test.go
freelist_hmap.go:167:           panic("pgids not sorted")
db.go:1067:     panic("bolt.DB.meta(): invalid meta pages")
db.go:1151:                     panic("freepages: failed to rollback tx")
db.go:1155:             panic("freepages: failed to open read only tx")
db.go:1163:                     panic(fmt.Sprintf("freepages: failed to get all reachable pages (%v)", e))
internal/common/page.go:136:                    panic(fmt.Sprintf("leading element count %d overflows int", c))
internal/common/page.go:361:            panic(fmt.Errorf("mergepgids bad len %d < %d + %d", len(dst), len(a), len(b)))
internal/common/meta.go:42:             panic(fmt.Sprintf("root bucket pgid (%d) above high water mark (%d)", m.root.root, m.pgid))
internal/common/meta.go:45:             panic(fmt.Sprintf("freelist pgid (%d) above high water mark (%d)", m.freelist, m.pgid))
freelist.go:119:                        panic(fmt.Sprintf("invalid page allocation: %d", id))
freelist.go:157:                panic(fmt.Sprintf("cannot free page 0 or 1: %d", p.Id()))
freelist.go:177:                        panic(fmt.Sprintf("page %d already freed", id))
freelist.go:269:                panic(fmt.Sprintf("invalid freelist page: %d, page type is %s", p.Id(), p.Typ()))
cursor.go:276:          panic(fmt.Sprintf("invalid page type: %d: %x", p.Id(), p.Flags()))
internal/common/utils.go:11:            panic(fmt.Sprintf("assertion failed: "+msg, v...))
internal/btesting/btesting.go:110:              panic("Please call Close() before MustReopen()")
bucket.go:563:                  panic(fmt.Sprintf("misplaced bucket header: %x -> %x", []byte(name), k))
bucket.go:566:                  panic(fmt.Sprintf("unexpected bucket header flag: %x", flags))
bucket.go:584:          panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", b.rootNode.pgid, b.tx.meta.Pgid()))
bucket.go:717:                  panic(fmt.Sprintf("inline bucket non-zero page access(2): %d != 0", id))
tx.go:212:                      panic("check fail: " + strings.Join(errs, "\n"))
mlock_windows.go:5:     panic("mlock is supported only on UNIX systems")
mlock_windows.go:10:    panic("munlock is supported only on UNIX systems")
node.go:76:             panic(fmt.Sprintf("invalid childAt(%d) on a leaf node", index))
node.go:119:            panic(fmt.Sprintf("pgId (%d) above high water mark (%d)", pgId, n.bucket.tx.meta.Pgid()))
node.go:121:            panic("put: zero-length old key")
node.go:123:            panic("put: zero-length new key")
node.go:190:            panic(fmt.Sprintf("inode overflow: %d (pgid=%d)", len(n.inodes), p.Id()))
node.go:331:                    panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", p.Id(), tx.meta.Pgid()))

The text was updated successfully, but these errors were encountered:

infdahai · 2023-05-04T13:56:51Z

I fix this.

ptabor · 2023-05-04T14:20:14Z

A library should not panic at all.

I don't agree with this statement in this specific context:

In general libraries should not panic to communicate a recoverable error, that e.g. is caused by the wrong invocation of the library.
In case of assertions that signal unexpected condition that is not recoverable, I think that panic is significantly better than the error that:
a) might get ignored
b) might mean that the system will keep operating on the partially corrupted state (no clean rollback after the error or risk of hitting the same conditions again and again, leading to resource starvation)

Looking at the error list above... most mean that the underlying bbolt file is corrupted or inconsistent.
The chances of recovery of such file are big if the file is not subject to continous mutations and the problem will be surfaced soon after the corruption happens.

Moreover panic in golang is not 'terminal'. If the user of the library is determined to continue (e.g. because of hosting multiple instances in a single process), the panic can get explicitly recovered.

To summarise, I recommend:

Keep it panic whenever there is chance we just corrupted DB
Keep it error when it's user fault (btesting.go:110: panic("Please call Close() before MustReopen()") ? )
There is middle-ground for this rules: When you open in RO mode already corrupted file and the check fails: There is slim chance you just corrupted the file, so it's a reason for error. At the same time keeping all code to be consistent around panicking on data inconsistency is easier to maintain. What we can do, is to warn callers that the DB.Open operation might panic when the file is corrupted and the user needs to 'recovery' if they need.

infdahai · 2023-05-04T15:20:23Z

I try this in infdahai@7d07d9f. And I agree with @ptabor said. We can replace some panics with error messages, such as panic(fmt.Sprintf("page %d already freed", id)). But other panics are good indications of the DB crash.

ptabor · 2023-05-04T21:01:00Z

"page %d already freed" is also indicator of either bug in freelist implementation or memory stomping or ram/cpu corruption. It shouldn't happen regardless of bbolt lib user actions.

cenkalti · 2023-05-08T20:09:31Z

keeping all code to be consistent around panicking on data inconsistency is easier to maintain

I definitely agree on this. However, I think this should not be the default behavior. Think about these 2 cases:

BoltDB may be part of a bigger program that other parts has nothing to do with the database and can continue to operate normally.
The corruption only causes an error in write path (i.e. freelist structure). Read operations should be able to continue to operate normally.

Panicking takes the whole program down which is not an ideal decision for a library in my opinion.

I propose this:

When corruption is detected, instead of panicking, return a CorruptionError to the caller.
Mark the db as corrupted with a boolean flag. In this state, database switches into read-only mode.
All subsequent mutation operations on db will return CorruptionError wrapping the original error.

There is also a special case. Recover only works when called from the same goroutine as the panic is called in. For example:

bbolt/db.go

Lines 1178 to 1182 in 56e033b

    
           go func() { 
        
           	for e := range ech { 
        
           		panic(fmt.Sprintf("freepages: failed to get all reachable pages (%v)", e)) 
        
           	} 
        
           }()

This code block executes when opening a db and it's not possible to recover that panic.

In overall, I see corruption errors as non-fatal. I think it's nicer to let the user decide by returning regular errors.

cenkalti mentioned this issue May 16, 2023

Document mmap error handling #71

Open

cenkalti self-assigned this May 18, 2023

cenkalti added type/bug area/corruption labels May 18, 2023

github-actions bot added the stale label Apr 17, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024

ahrtr reopened this May 9, 2024

ahrtr removed the stale label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove panics in the library code #447

Remove panics in the library code #447

cenkalti commented Apr 1, 2023

infdahai commented May 4, 2023 •

edited

ptabor commented May 4, 2023 •

edited

infdahai commented May 4, 2023

ptabor commented May 4, 2023

cenkalti commented May 8, 2023

Remove panics in the library code #447

Remove panics in the library code #447

Comments

cenkalti commented Apr 1, 2023

infdahai commented May 4, 2023 • edited

ptabor commented May 4, 2023 • edited

infdahai commented May 4, 2023

ptabor commented May 4, 2023

cenkalti commented May 8, 2023

infdahai commented May 4, 2023 •

edited

ptabor commented May 4, 2023 •

edited