Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Panic after power fail #1883

Open
dgallion1 opened this issue Feb 23, 2023 · 27 comments · May be fixed by #2033
Open

[BUG]: Panic after power fail #1883

dgallion1 opened this issue Feb 23, 2023 · 27 comments · May be fixed by #2033
Labels
kind/bug Something is broken.

Comments

@dgallion1
Copy link

dgallion1 commented Feb 23, 2023

What version of Badger are you using?

v3.2103.2

What version of Go are you using?

go version go1.19.4 windows/amd64

Have you tried reproducing the issue with the latest release?

None

What is the hardware spec (RAM, CPU, OS)?

OS Name Microsoft Windows 11 Pro
Version 10.0.22621 Build 22621
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name AMD2
System Manufacturer ASUS
System Model System Product Name
System Type x64-based PC
System SKU SKU
Processor AMD Ryzen 9 7950X 16-Core Processor, 4501 Mhz, 16 Core(s), 32 Logical Processor(s)
BIOS Version/Date American Megatrends Inc. 0705, 10/5/2022
SMBIOS Version 3.5
Embedded Controller Version 255.255
BIOS Mode UEFI
BaseBoard Manufacturer ASUSTeK COMPUTER INC.
BaseBoard Product ROG CROSSHAIR X670E EXTREME
BaseBoard Version Rev 1.xx
Platform Role Desktop
Secure Boot State Off
PCR7 Configuration Elevation Required to View
Windows Directory C:\Windows
System Directory C:\Windows\system32
Boot Device \Device\HarddiskVolume1
Locale United States
Hardware Abstraction Layer Version = "10.0.22621.819"
User Name AMD2\dgall
Time Zone Eastern Standard Time
Installed Physical Memory (RAM) 128 GB
Total Physical Memory 127 GB
Available Physical Memory 108 GB
Total Virtual Memory 145 GB
Available Virtual Memory 119 GB
Page File Space 18.0 GB
Page File C:\pagefile.sys
Kernel DMA Protection Off
Virtualization-based security Running
Virtualization-based security Required Security Properties
Virtualization-based security Available Security Properties Base Virtualization Support, DMA Protection, UEFI Code Readonly, SMM Security Mitigations 1.0, Mode Based Execution Control
Virtualization-based security Services Configured Hypervisor enforced Code Integrity
Virtualization-based security Services Running Credential Guard, Hypervisor enforced Code Integrity
Windows Defender Application Control policy Enforced
Windows Defender Application Control user mode policy Off
Device Encryption Support Elevation Required to View
A hypervisor has been detected. Features required for Hyper-V will not be displayed.

What steps will reproduce the bug?

Power removed.

Expected behavior and actual result.

After a power failure, running Windows 11.

panic: runtime error: slice bounds out of range [-16:] [recovered]
panic:
== Recovering from initIndex crash ==
File Info: [ID: 1157, Size: 8545749, Zeros: 8545749]
isEnrypted: true checksumLen: 0 checksum: indexLen: 0 index: []
== Recovered ==

goroutine 33 [running]:
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1.1()
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:351 +0xa8
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1()
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:397 +0xc2
panic({0x7ff7462d5cc0, 0xc00059e018})
C:/Program Files/Go/src/runtime/panic.go:884 +0x212
github.com/dgraph-io/badger/v3/table.(*Table).decrypt(0xc00087c030?, {0x22decfa65cd?, 0xc000159970?, 0x7ff745b5e2ce?}, 0xcd?)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:753 +0x16b
github.com/dgraph-io/badger/v3/table.(*Table).readTableIndex(0xc00021ee40)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:702 +0x5c
github.com/dgraph-io/badger/v3/table.(*Table).initIndex(0xc00021ee40)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:462 +0x19d
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest(0xc00021ee40)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:401 +0x7f
github.com/dgraph-io/badger/v3/table.OpenTable(0xc000142260, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0xc000852200, 0x1, ...})
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:308 +0x272
github.com/dgraph-io/badger/v3.newLevelsController.func1({0xc0004a60c0, 0x13}, {0x0?, 0x0?, 0x0?})
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:150 +0x1f9
created by github.com/dgraph-io/badger/v3.newLevelsController
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:129 +0x585

Additional information

On Windows.
Our production DB is dead so often that we take a full backup every 5min.
Then attempt to recover and restore if needed.
Or keys are lost, and we don't know that.
Can't capture the error with recover(). How is that possible?

@dgallion1 dgallion1 added the kind/bug Something is broken. label Feb 23, 2023
@mangalaman93
Copy link
Contributor

seems related to #1625 #1626 and DGRAPH-2841 on JIRA board.

It seems like the whole file has 0s in it and no data. We do not support windows officially yet, but I am curious why it would crash every 5 min. It is crashing with this issue every 5 min?

@mangalaman93
Copy link
Contributor

also related to #1797

@dgallion1
Copy link
Author

It's not crashing every 5min. We take a complete backup every 5min and keep them in min/hour/day folders. So after a crash, we can recover. Sometimes the DB is just corrupted, and we can't detect that. So 5min backups start being overwritten with insufficient data.

Thinking about always backing up during shutdown.
Then using the backup when starting back up.

I don't know what to do about the silent loss of keys.

@mangalaman93
Copy link
Contributor

If you could share your last backup and the data after immediate crash (If it is not confidential) and along with logs, that would give up good amount of information to debug this.

As I said, we currently do not run our tests on windows, I am not sure if this will change in future. mmap could be acting different on windows leading to corruption of the db. I do see quite a few open issues on windows related to this, maybe that pushes us to at least ensure our tests pass on windows. Thanks for information in any case.

@dgallion1
Copy link
Author

I appreciate your interest. I'll try to gather more data for you. Confidential data is an issue.

@shufps
Copy link

shufps commented Mar 19, 2023

It's not crashing every 5min. We take a complete backup every 5min and keep them in min/hour/day folders.

How do you do this?

Since it's ACID compliant, it should work. But only if you take an atomic filesystem snapshot. Copying the database just with rsync during it's being used certainly is not the way to go.

@dgallion1
Copy link
Author

A task reads all keys/values and writes them to a file.

@dgallion1
Copy link
Author

panic: runtime error: slice bounds out of range [-16:] [recovered]
panic:
== Recovering from initIndex crash ==
File Info: [ID: 1193, Size: 5060869, Zeros: 5060869]
isEnrypted: true checksumLen: 0 checksum: indexLen: 0 index: []
== Recovered ==

goroutine 58 [running]:
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1.1()
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:351 +0xa8
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1()
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:397 +0xc2
panic({0x7ff62bd59980, 0xc00012a330})
C:/Program Files/Go/src/runtime/panic.go:884 +0x213
github.com/dgraph-io/badger/v3/table.(*Table).decrypt(0xc000612110?, {0x1a2e75638fd?, 0xc0001cb970?, 0x7ff62b5d02ce?}, 0xfd?)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:753 +0x16b
github.com/dgraph-io/badger/v3/table.(*Table).readTableIndex(0xc00059c0c0)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:702 +0x5c
github.com/dgraph-io/badger/v3/table.(*Table).initIndex(0xc00059c0c0)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:462 +0x19d
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest(0xc00059c0c0)
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:401 +0x7f
github.com/dgraph-io/badger/v3/table.OpenTable(0xc000236000, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0xc00016a780, 0x1, ...})
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:308 +0x272
github.com/dgraph-io/badger/v3.newLevelsController.func1({0xc00012a2a0, 0x13}, {0x0?, 0x0?, 0x0?})
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:150 +0x1f9
created by github.com/dgraph-io/badger/v3.newLevelsController
C:/Users/dgall/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:129 +0x585

@shufps
Copy link

shufps commented Apr 28, 2023

A task reads all keys/values and writes them to a file.

That's not a clean and consistent approach. If you have to do without shutdown, you have to do it atomically on filesystem level and rely on ACID conformity and recovery.

@dgallion1
Copy link
Author

It can't be shut down every 5min to do a backup.
A Windows update or power failure has been known to crash the DB.
A hot backup is the best I know of.
Fortunately, the design replicates data from other Badger DBs.
So much of the DB heals itself.
Looking at pushing the rest of it into something like Postgres.
Going full Rube Goldberg here, then restoring critical data from Postgres at startup.

@shufps
Copy link

shufps commented May 1, 2023

It can't be shut down every 5min to do a backup.

Do an atomic file-system snapshot. They are provided by e.g. zfs, btrfs. Or if you have the money storage systems that support them like from Netapp (but also others ofc because this is standard)

This should allow you to do hot-backups with a consistent state because, as said, it's an atomic operation.

A Windows update or power failure has been known to crash the DB.

Oh Windows ....... Hmm, this is a problem.

@mangalaman93
Copy link
Contributor

Badger provides snapshot isolation. That means, if you use a txn to read data during backup, you will get a consistent snapshot of the data. You don't need to worry about anything in that case and your snapshot is a consistent view of the data at a given point in time.

@gemone
Copy link

gemone commented May 31, 2023

This is indeed a problem, but the main issue is that the exception cannot be caught through err or recover, which will result in continuous crashes of the program and make subsequent processing methods unachievable.

package main

import (
        "log"

        "github.com/dgraph-io/badger/v4"
)

func main() {
        // can not recover
        defer func() {
                err := recover()
                if err != nil {
                        log.Fatal("catch panic: ", err)
                }
        }()

        opt := badger.DefaultOptions("../db")
        opt.WithMemTableSize(1 << 20)
        opt.WithValueThreshold(1 << 10)
        opt.WithValueLogFileSize(1 << 20)
        opt.WithLoggingLevel(badger.DEBUG)

        db, err := badger.Open(opt)
        // no err
        if err != nil {
                log.Fatal("have err: ", err)
        }
        db.Close()
}
panic: runtime error: index out of range [3] with length 0 [recovered]
        panic: 
== Recovering from initIndex crash ==
File Info: [ID: 44, Size: 680, Zeros: 680]
isEnrypted: false checksumLen: 0 checksum:  indexLen: 0 index: [] 
== Recovered ==


goroutine 16 [running]:
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1.1()
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/table/table.go:353 +0xa8
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1()
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/table/table.go:399 +0xc2
panic({0x867b00, 0xc00001c468})
        /home/muk/.g/go/src/runtime/panic.go:884 +0x213
github.com/google/flatbuffers/go.GetInt32(...)
        /home/muk/codes/lang/go/pkg/mod/github.com/google/flatbuffers@v1.12.1/go/encode.go:85
github.com/google/flatbuffers/go.GetUOffsetT(...)
        /home/muk/codes/lang/go/pkg/mod/github.com/google/flatbuffers@v1.12.1/go/encode.go:121
github.com/dgraph-io/badger/v4/fb.GetRootAsTableIndex(...)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/fb/TableIndex.go:14
github.com/dgraph-io/badger/v4/table.(*Table).readTableIndex(0xc000258000)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/table/table.go:707 +0x15e
github.com/dgraph-io/badger/v4/table.(*Table).initIndex(0xc000258000)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/table/table.go:464 +0x19d
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest(0xc000258000)
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/table/table.go:403 +0x7f
github.com/dgraph-io/badger/v4/table.OpenTable(0xc000140140, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0x0, 0x1, ...})
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/table/table.go:309 +0x278
github.com/dgraph-io/badger/v4.newLevelsController.func1({0xc0001320f0, 0x10}, {0x0?, 0x0?, 0x0?})
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/levels.go:150 +0x1f9
created by github.com/dgraph-io/badger/v4.newLevelsController
        /home/muk/codes/lang/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.1.0/levels.go:129 +0x585
exit status 2

@mangalaman93
Copy link
Contributor

Why would that be? You should be able to recover from panic using the recover() function. I don't have a windows machine, but do you have any insights?

@dgallion1
Copy link
Author

The example that Gemone gave isn't crashing for me.
I can't find a case where recover() won't catch the error.
This could be what I saw:
"In Go 1.16, the recover() function was made more robust by handling panics that occur in deferred functions. Previously, panics in deferred functions would cause the program to crash."

Now use this "github.com/gorpher/gowin32/session_notifications" to close the DB before a restart.

@gemone
Copy link

gemone commented Jun 2, 2023

Why would that be? You should be able to recover from panic using the recover() function. I don't have a windows machine, but do you have any insights?

@mangalaman93 This issue is not related to the system, but rather caused by the power outage that resulted in database file corruption. This issue would occur on any system. The inability to recover() may be due to an internal recover() call within Open(), or a panic occurring within a goroutine.

panic(fmt.Sprintf("%s\n== Recovered ==\n", debugBuf.String()))

The example that Gemone gave isn't crashing for me. I can't find a case where recover() won't catch the error. This could be what I saw: "In Go 1.16, the recover() function was made more robust by handling panics that occur in deferred functions. Previously, panics in deferred functions would cause the program to crash."

Now use this "github.com/gorpher/gowin32/session_notifications" to close the DB before a restart.

@dgallion1 I am using a program created with a service that automatically restarts in case of a crash. However, whenever the DB file is corrupted, it results in a panic that cannot recover(), which causes the service to keep restarting.

Go version: 1.20.3
Badger: 4.1.0

@mangalaman93
Copy link
Contributor

If you could help us reproduce it on linux, I will be happy to look into this on higher priority.

@dgallion1
Copy link
Author

dgallion1 commented Jun 22, 2023

I have the DB that caused this and will look into it more.

badger 2023/06/22 07:56:21 INFO: All 6 tables opened in 212ms
badger 2023/06/22 07:56:21 INFO: Discard stats nextEmptySlot: 0
badger 2023/06/22 07:56:21 INFO: Set nextTxnTs to 1687428125010547400
badger 2023/06/22 07:56:21 INFO: Deleting empty file: .//keyValue\000276.vlog
badger 2023/06/22 07:56:21 INFO:
badger 2023/06/22 07:56:21 INFO: Level: 0. 0 B Size. 0 B Max.
badger 2023/06/22 07:56:21 INFO: Level: 1. 0 B Size. 10 MiB Max.
badger 2023/06/22 07:56:21 INFO: Level: 2. 0 B Size. 10 MiB Max.
badger 2023/06/22 07:56:21 INFO: Level: 3. 0 B Size. 10 MiB Max.
badger 2023/06/22 07:56:21 INFO: Level: 4. 0 B Size. 22 MiB Max.
badger 2023/06/22 07:56:21 INFO: Level: 5. 240 MiB Size. 215 MiB Max.
badger 2023/06/22 07:56:21 INFO: Level: 6. 2.1 GiB Size. 2.1 GiB Max.
badger 2023/06/22 07:56:21 INFO: Attempting to compact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}}
badger 2023/06/22 07:56:21 WARNING: While running doCompact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}}. Error: Unable to fill tables
badger 2023/06/22 07:56:21 WARNING: While running doCompact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}}. Error: Unable to fill tables
badger 2023/06/22 07:56:21 WARNING: While running doCompact with {level:5 score:1.71 adjusted:0 dropPrefixes:[] t:{baseLevel:0 targetSz:[] fileSz:[]}}. Error: Unable to fill tables
panic: Allocator can not allocate more than 64 buffers

goroutine 190 [running]:
github.com/dgraph-io/ristretto/z.(*Allocator).addBufferAt(0xc00075df20?, 0x7ff789f87cc5?, 0xd123022ac0?)
C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/ristretto@v0.1.1/z/allocator.go:251 +0x1d9
github.com/dgraph-io/ristretto/z.(*Allocator).Allocate(0xc000494e00, 0x2298)
C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/ristretto@v0.1.1/z/allocator.go:302 +0x159
github.com/dgraph-io/badger/v3/table.(*Builder).compressData(0x0?, {0xd1006761f6, 0x1d8c, 0x9e0a})
C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.5/table/builder.go:519 +0x99
github.com/dgraph-io/badger/v3/table.(*Builder).handleBlock(0xc0001b82d0)
C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.5/table/builder.go:169 +0xfc
created by github.com/dgraph-io/badger/v3/table.NewTableBuilder
C:/Users/SamPlace/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.5/table/builder.go:155 +0x28f

@dgallion1
Copy link
Author

dgallion1 commented Jun 22, 2023

This code crashes and isn't caught. Same thing in Badger.

func main() {
	defer RecoverStackTrace()
	go func() {
		panic("death")
	}()
	select {}
}

I was able to catch the crash here.
builder.go

func (b *Builder) handleBlock() {
	defer RecoverStackTrace()  <--- Catch it here now.
	defer b.wg.Done()

	doCompress := b.opts.Compression != options.None
	for item := range b.blockChan {

@Akhilesh53
Copy link

Is there any way to resolve this, this is becoming frequent in my case.

@gemone
Copy link

gemone commented Sep 21, 2023

Is there any way to resolve this, this is becoming frequent in my case.

@Akhilesh53 My current approach is to store the cached state in the registry. If the Badger does not start properly, I will delete the damaged files and restart the Badger. For reference only.

From the source code, it seems that this error was not passed for processing, which makes it difficult for me to handle the error properly.

@Akhilesh53
Copy link

Akhilesh53 commented Sep 21, 2023

But why badger is creating corrupted files and cannot read the same itself? I am these three errors continuously, mentioning below.


badger 2023/09/21 12:38:37 ERROR: Got error: While opening memtable: /data/dedupe/data/table_50/Pentagram/00067.mem error: while opening file: /data/dedupe/data/table_50/Pentagram/00067.mem error: open /data/dedupe/data/table_50/Pentagram/00067.mem: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/00067.mem
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4.(*logFile).open
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:554
github.com/dgraph-io/badger/v4.(*DB).openMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:130
github.com/dgraph-io/badger/v4.(*DB).newMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:151
github.com/dgraph-io/badger/v4.(*DB).ensureRoomForWrite
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:995
github.com/dgraph-io/badger/v4.(*DB).writeRequests
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:829
github.com/dgraph-io/badger/v4.(*DB).doWrites.func1
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:884
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598 for id: 67
badger 2023/09/21 12:38:37 ERROR: writeRequests: writeRequests err: cannot create new mem table error: newMemTable error: While opening memtable: /data/dedupe/data/table_50/Pentagram/00067.mem error: while opening file: /data/dedupe/data/table_50/Pentagram/00067.mem error: open /data/dedupe/data/table_50/Pentagram/00067.mem: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/00067.mem
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4.(*logFile).open
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:554
github.com/dgraph-io/badger/v4.(*DB).openMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:130
github.com/dgraph-io/badger/v4.(*DB).newMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:151
github.com/dgraph-io/badger/v4.(*DB).ensureRoomForWrite
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:995
github.com/dgraph-io/badger/v4.(*DB).writeRequests
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:829
github.com/dgraph-io/badger/v4.(*DB).doWrites.func1
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:884
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598
badger 2023/09/21 12:38:38 ERROR: Failure while flushing memtable to disk: error while creating table err: while creating table: /data/dedupe/data/table_50/Pentagram/000133.sst error: open /data/dedupe/data/table_50/Pentagram/000133.sst: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/000133.sst
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4/table.CreateTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/table/table.go:259
github.com/dgraph-io/badger/v4.(*DB).handleFlushTask
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:1061
github.com/dgraph-io/badger/v4.(*DB).flushMemtable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:1083
github.com/dgraph-io/badger/v4.Open.func5
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:354
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598. Retrying...
total time: cluster time for 50 took 13m33.967768745s

panic: cannot create new mem table error: newMemTable error: While opening memtable: /data/dedupe/data/table_50/Pentagram/00067.mem error: while opening file: /data/dedupe/data/table_50/Pentagram/00067.mem error: open /data/dedupe/data/table_50/Pentagram/00067.mem: no such file or directory
unable to open: /data/dedupe/data/table_50/Pentagram/00067.mem
github.com/dgraph-io/ristretto/z.OpenMmapFile
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/ristretto/z/file.go:80
github.com/dgraph-io/badger/v4.(*logFile).open
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:554
github.com/dgraph-io/badger/v4.(*DB).openMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:130
github.com/dgraph-io/badger/v4.(*DB).newMemTable
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/memtable.go:151
github.com/dgraph-io/badger/v4.(*DB).ensureRoomForWrite
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:995
github.com/dgraph-io/badger/v4.(*DB).writeRequests
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:829
github.com/dgraph-io/badger/v4.(*DB).doWrites.func1
    /data/badger_broadband/badger-cluster/vendor/github.com/dgraph-io/badger/v4/db.go:884
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598


goroutine 3487 [running]:
cluster/badger_cluster.(*Cluster).AddMemberAddress(0xc490f28df0, {​​​​​​​​0xbf7d98, 0xc000194000}​​​​​​​​​​​​​​​, 0x32?, 0x32, {​​​​​​​​​​​​​​​0xc48e35f800, 0x2f, 0x7f521b3908f0?}​​​​​​​​​​​​​​​)
    /data/badger_broadband/badger-cluster/badger_cluster/cluster.go:69 +0x276
main.makeCluster({​​​​​​​​​​​​​​​0xbf7d98, 0xc000194000}​​​​​​​​​​​​​​​, 0x32)
    /data/badger_broadband/badger-cluster/app/create_cluster.go:99 +0x3ea
main.worker(...)
    /data/badger_broadband/badger-cluster/app/create_cluster.go:59
main.main.func1()
    /data/badger_broadband/badger-cluster/app/create_cluster.go:45 +0x8e
created by main.main
    /data/badger_broadband/badger-cluster/app/create_cluster.go:42 +0x7e
  1. In Bytes() function , we are getting negative offset.

negative offset :: -1147080149
EOF
negative offset :: -1147080153
negative offset :: -5116465645
EOF
negative offset :: -5116465645
EOF
negative offset :: -1147080149
EOF
negative offset :: -1147080153
negative offset :: -5116465645
EOF

panic: runtime error: slice bounds out of range [-4169358568:] [recovered]
    panic: runtime error: slice bounds out of range [-4169358568:]
    panic: 
== Recovering from initIndex crash ==
File Info: [ID: 95, Size: 69121215, Zeros: 0]
isEnrypted: false checksumLen: 4238479779 
== Recovered ==

goroutine 6932842 [running]:
github.com/dgraph-io```
/badger/v4/table.(*Table).initBiggestAndSmallest.func1.1()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:353 +0xa8
panic({​​​​​​​​​​​​​​​0xba9420, 0xc4324db2d8}​​​​​​​​​​​​​​​)
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/ristretto/z/file.go:116
github.com/dgraph-io/badger/v4/table.(*Table).read(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:420
github.com/dgraph-io/badger/v4/table.(*Table).readNoFail(0xd0a740?, 0xc4a7b68810?, 0xbfe74d?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:424 +0xd0
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:383 +0x2ff
panic({​​​​​​​​​​​​​​​0xba9420, 0xc4324db2c0}​​​​​​​​​​​​​​​)
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/ristretto/z/file.go:116
github.com/dgraph-io/badger/v4/table.(*Table).read(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:420
github.com/dgraph-io/badger/v4/table.(*Table).readNoFail(0x6?, 0x0?, 0x9?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:424 +0xd0
github.com/dgraph-io/badger/v4/table.(*Table).initIndex(0xc47e194180)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:445 +0x109
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest(0xc47e194180)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:403 +0x7f
github.com/dgraph-io/badger/v4/table.OpenTable(0xc31e8f4160, {​​​​​​​​​​​​​​​0x0, 0x1, 0x1000000, 0xf33333, 0x0, 0x3f847ae147ae147b, 0x1000, 0x0, 0x1, ...}​​​​​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:309 +0x278
github.com/dgraph-io/badger/v4/table.CreateTable({​​​​​​​​​​​​​​​0xc167146b80, 0x33}​​​​​​​​​​​​​​​, 0xccef5a4090)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:273 +0x305
github.com/dgraph-io/badger/v4.(*levelsController).subcompact.func4(0xccef5a4090, 0xc025c8b6e0?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:846 +0x11b
created by github.com/dgraph-io/badger/v4.(*levelsController).subcompact
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:836 +0x707
 

panic: runtime error: invalid memory address or nil pointer dereference
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x941247]


goroutine 2551 [running]:
github.com/dgraph-io/badger/v4/table.(*Table).offsetsLength(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:133
github.com/dgraph-io/badger/v4/table.(*Table).DecrRef(0xc67b5ca9c0)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:169 +0x87
github.com/dgraph-io/badger/v4.decrRefs(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/level_handler.go:178
github.com/dgraph-io/badger/v4.(*levelsController).compactBuildTables.func6()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:955 +0x56
github.com/dgraph-io/badger/v4.(*levelsController).runCompactDef.func1()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:1431 +0x25
panic({​​​​​​​​​​​0xb11c60, 0x11587a0}​​​​​​​​​​​)
    /usr/local/go/src/runtime/panic.go:884 +0x213
github.com/dgraph-io/badger/v4/fb.(*TableIndex).StaleDataSize(0x0)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/fb/TableIndex.go:132 +0x1d
github.com/dgraph-io/badger/v4/table.(*Table).StaleDataSize(0x0?)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/table/table.go:667 +0x1e
github.com/dgraph-io/badger/v4.(*levelHandler).addSize(...)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/level_handler.go:214
github.com/dgraph-io/badger/v4.(*levelHandler).replaceTables(0xc242f49140, {​​​​​​​​​​​0xc48d01a248, 0x1, 0x4208767?}​​​​​​​​​​​, {​​​​​​​​​​​0xc6fd7f5a98, 0x1, 0xc48d02b800?}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/level_handler.go:137 +0x2b3
github.com/dgraph-io/badger/v4.(*levelsController).runCompactDef(_, _, _, {​​​​​​​​​​​0xc48bea9250, 0x1, {​​​​​​​​​​​0x4, {​​​​​​​​​​​0xc48d02b7c0, 0x7, 0x7}​​​​​​​​​​​, {​​​​​​​​​​​0xc48d02b800, ...}​​​​​​​​​​​}​​​​​​​​​​​, ...}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:1460 +0x43b
github.com/dgraph-io/badger/v4.(*levelsController).doCompact(0xc116de2a80, 0x1, {​​​​​​​​​​​0x5, 0x3ffd1ba93016f62f, 0x3ffd1ba93016f62f, {​​​​​​​​​​​0x0, 0x0, 0x0}​​​​​​​​​​​, {​​​​​​​​​​​0x4, {​​​​​​​​​​​0xc48d02b7c0, ...}​​​​​​​​​​​, ...}​​​​​​​​​​​}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:1544 +0x538
github.com/dgraph-io/badger/v4.(*levelsController).runCompactor.func2({​​​​​​​​​​​0x5, 0x3ffd1ba93016f62f, 0x3ffd1ba93016f62f, {​​​​​​​​​​​0x0, 0x0, 0x0}​​​​​​​​​​​, {​​​​​​​​​​​0x4, {​​​​​​​​​​​0xc48d02b7c0, 0x7, 0x7}​​​​​​​​​​​, ...}​​​​​​​​​​​}​​​​​​​​​​​)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:465 +0x78
github.com/dgraph-io/badger/v4.(*levelsController).runCompactor.func3()
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:488 +0x158
github.com/dgraph-io/badger/v4.(*levelsController).runCompactor(0xc116de2a80, 0x1, 0xc2026ffef0)
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:517 +0x382
created by github.com/dgraph-io/badger/v4.(*levelsController).startCompact
    /data/badger_broadband/badger_grpc/vendor/github.com/dgraph-io/badger/v4/levels.go:354 +0x4e
 

@sheabot
Copy link

sheabot commented Sep 26, 2023

I also encountered this on v3.2103.2 while writing unit tests on macOS that simulate a power loss and corruption on an embedded device. The following app causes a panic.

I think this could be handled better by adding bounds checking to table.(*Table).initIndex(). If we check readPos for a negative value before calling table.(*Table).readNoFail(), then we could return an error that would bubble up to Open(). Then the caller could handle the error instead of causing an unrecoverable panic.

https://github.com/dgraph-io/badger/blob/v3.2103.2/table/table.go#L430

func main() {
	dirpath := "/tmp/badger-issue-1883"

	// Clean environment
	err := os.RemoveAll(dirpath)
	if err != nil {
		log.Fatal(err)
	}

	// Open new database
	db, err := badger.Open(badger.DefaultOptions(dirpath))
	if err != nil {
		log.Fatal(err)
	}

	// Write to database
	err = db.Update(func(txn *badger.Txn) error {
		e := badger.NewEntry([]byte("key"), []byte("value"))
		if err := txn.SetEntry(e); err != nil {
			return err
		}
		return nil
	})
	if err != nil {
		log.Fatal(err)
	}

	// Close database
	db.Close()

	// Corrupt sst file
	filepath := path.Join(dirpath, "000001.sst")
	os.WriteFile(filepath, []byte("blah"), 0644)

	// Reopen database
	db, err = badger.Open(badger.DefaultOptions(dirpath))
	if err != nil {
		log.Fatal(err)
	}
}
badger 2023/09/26 16:00:47 INFO: All 0 tables opened in 0s
badger 2023/09/26 16:00:47 INFO: Discard stats nextEmptySlot: 0
badger 2023/09/26 16:00:47 INFO: Set nextTxnTs to 0
badger 2023/09/26 16:00:47 INFO: Lifetime L0 stalled for: 0s
badger 2023/09/26 16:00:47 INFO: 
Level 0 [ ]: NumTables: 01. Size: 171 B of 0 B. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 64 MiB
Level 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 4 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 5 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 6 [B]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level Done
badger 2023/09/26 16:00:47 INFO: All 1 tables opened in 0s
badger 2023/09/26 16:00:47 INFO: Discard stats nextEmptySlot: 0
panic: runtime error: slice bounds out of range [-1651269992:] [recovered]
	panic: runtime error: slice bounds out of range [-1651269992:]
	panic: 
== Recovering from initIndex crash ==
File Info: [ID: 1, Size: 4, Zeros: 0]
isEnrypted: false checksumLen: 1651269992 
== Recovered ==


goroutine 36 [running]:
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1.1()
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:351 +0x9c
panic({0x104ecb380, 0x1400002e180})
	/opt/homebrew/Cellar/go/1.20.7/libexec/src/runtime/panic.go:884 +0x204
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/ristretto@v0.1.0/z/file.go:116
github.com/dgraph-io/badger/v3/table.(*Table).read(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:418
github.com/dgraph-io/badger/v3/table.(*Table).readNoFail(0x104f005a8?, 0x140001e5d40?, 0x104d579df?)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:422 +0xe0
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest.func1()
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:381 +0x27c
panic({0x104ecb380, 0x1400002e168})
	/opt/homebrew/Cellar/go/1.20.7/libexec/src/runtime/panic.go:884 +0x204
github.com/dgraph-io/ristretto/z.(*MmapFile).Bytes(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/ristretto@v0.1.0/z/file.go:116
github.com/dgraph-io/badger/v3/table.(*Table).read(...)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:418
github.com/dgraph-io/badger/v3/table.(*Table).readNoFail(0x1400009d948?, 0x10497f140?, 0x1400009d948?)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:422 +0xe0
github.com/dgraph-io/badger/v3/table.(*Table).initIndex(0x1400015a240)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:443 +0xf0
github.com/dgraph-io/badger/v3/table.(*Table).initBiggestAndSmallest(0x1400015a240)
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:401 +0x5c
github.com/dgraph-io/badger/v3/table.OpenTable(0x140000cc140, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0x0, 0x1, ...})
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/table/table.go:308 +0x248
github.com/dgraph-io/badger/v3.newLevelsController.func1({0x140002ee150, 0x21}, {0x0?, 0x0?, 0x0?})
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:150 +0x198
created by github.com/dgraph-io/badger/v3.newLevelsController
	/Users/user/go/pkg/mod/github.com/dgraph-io/badger/v3@v3.2103.2/levels.go:129 +0x488
make[1]: *** [run] Error 2
make: *** [run] Error 2

@peiyangli
Copy link

I often encounter this problem. The database will be damaged when the Windows system is physically powered off or has a blue screen. There should be data that has not been written to the disk.

@VinGarcia
Copy link

VinGarcia commented Dec 6, 2023

Hey guys I wrote a draft PR on a badger fork for internal use on the company where I work on, and on my manual tests it fixed this issue by allowing the caller of the Open() function to handle panics in an appropriate way:

VinGarcia#1

@mangalaman93 if you think this solution looks promising I can polish it up, add some tests and create a proper PR for the official badger repo.

Let me know what you think.

Oh and I can't share the database I am using because it contains info private to my current company, but I will try to produce a broken database by repeatedly forcibly killing a Go process that is reading and writing to a sample database. I will try to get it done tonight.

[edit]: The panics are being forwarded correctly but I found an issue so the PR is not working properly yet, I am working on it [/edit] it's fixed now I think.

@VinGarcia VinGarcia linked a pull request Dec 7, 2023 that will close this issue
@VinGarcia
Copy link

@mangalaman93 I just created a new PR based on badger V4:

#2033

@panagiotis-bitharis
Copy link

I could reproduce the panic by replacing an *.sst file with an empty *.sst. It would be great to have this fixed, so we can react when a database is corrupt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

Successfully merging a pull request may close this issue.

9 participants