Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(PanicHandling): Add Options.PanicHandler #2033

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

VinGarcia
Copy link

Problem

This PR fixes #1883.

The issue is that when a device running badger is abruptly powered off it might cause a database corruption, and subsequently badger will continuously panic from inside a separate Goroutine when trying to open the database.

Panicking from inside a Goroutine causes all other Goroutines to die with an error code, and the main isssue is that having a defer calling recover() on other Goroutines cannot prevent this from happening, making it extremely difficult to recover from such a failure if the affected developer has no control over the panicking goroutine.

Solution

The solution proposed here is to create all Goroutines with a special method called DB._go() that will recover from the panic if the user provides a PanicHandler on the Options argument.

If a user decides to use this feature he should make sure not to ignore the panic, but this would give a window for developers to work around this issue by for example deleting the offending database and creating a new one, or even just reporting an error to the appropriate API so a human can be notified of this issue.

The main complexity in this PR is the issue regarding the use of closure and Goroutines inside for loops:

https://go.dev/doc/faq#closures_and_goroutines

Before this PR these issues were handled by passing the arguments to the Goroutines as in:

for i := 0; i < n; i++ {
    go someFunc(i)
    // or
    go func(threadID int) {
        doSomethingWith(threadID)
    }(i)
}

Due to the way the DB._go() function was implemented (i.e. receiving a closure as argument) it is not possible anymore to trust this technique to make sure the data is copied and not passed as a reference to the closure, so instead I am doing it like this now wherever necessary:

for i := 0; i < n; i++ {
    i := i // This effectively creates a copy of i that will not be overwritten after this iteration
    db._go(func() {
        someFunc(i)
    }
    // or
    threadID := i
    db._go(func() {
        doSomethingWith(threadID)
    })
}

Copy link

netlify bot commented Dec 7, 2023

Deploy Preview for badger-docs canceled.

Name Link
🔨 Latest commit 548c3ce
🔍 Latest deploy log https://app.netlify.com/sites/badger-docs/deploys/6571d40086bf1e0008f504fa

@CLAassistant
Copy link

CLAassistant commented Dec 7, 2023

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@VinGarcia VinGarcia changed the title fix(PanicHandling): Add Options.PanicHandler for making it possible to recover from corrupted databases fix(PanicHandling): Add Options.PanicHandler Dec 7, 2023
Comment on lines +356 to +358
s.kv._go(func() {
s.runCompactor(i, lc)
})
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.kv is an instance of DB, so I assumed it was safe to start the Goroutine using this instance. But I am not 100% sure this instance is directly derived from the DB instance the user is using, so tell me if this is a mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Panic after power fail
2 participants