Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.0.100 (Corax) Killed by OOM, "corrupted" 50% of databases #18291

Open
xinix00 opened this issue Mar 26, 2024 · 27 comments
Open

6.0.100 (Corax) Killed by OOM, "corrupted" 50% of databases #18291

xinix00 opened this issue Mar 26, 2024 · 27 comments

Comments

@xinix00
Copy link
Contributor

xinix00 commented Mar 26, 2024

So I was syncing to a new server and indexing and it got killed by OOM, which is not why I started this case. Half of my databases now have:

Voron.Exceptions.InvalidJournalException: The last synced transaction id was 191 (in journal: 9) but the first transaction being read in the recovery process is 237 in journal /ravendb/Databases/CUSTOMERNAME/Journals/0000000000000000010.journal

Or on multiple indexes:
---> Voron.Exceptions.InvalidJournalException: Failed to open a storage at /ravendb/Databases/CUSTOMERNAME/Indexes/ArticleAll due to invalid or missing journal files. In order to load the storage successfully we need all journals to be not corrupted. The recommended approach is to reset the index in order to recover from this error. Alternatively you can temporarily start the server in dangerous mode so it will ignore invalid journals on startup:
Raven.Server --Storage.Dangerous.IgnoreInvalidJournalErrors=true
This switch is meant to be use only for recovery purposes. Please make sure that you won't use it after you manage to recover your data. Eventually you should reset the index anyway to be sure you won't experience any invalid data errors in the future.
Error details: The last synced transaction id was 23 (in journal: 2) but the first transaction being read in the recovery process is 117 in journal /ravendb/Databases/CUSTOMERNAME/Indexes/ArticleAll/Journals/0000000000000000003.journal (transaction has valid hash). Tx diff is: 94. Some journals might be missing. Debug details - file header Version: 23, HeaderRevision: 10, TransactionId: 23, LastPageNumber: 3313, RootPageNumber: 0, CurrentJournal: 3, LastSyncedJournal: 2, LastSyncedTransactionId: 2, Flags: None. Journal details: CurrentJournal - 3, LastSyncedJournal - 2, LastSyncedTransactionId - 23, Flags - None

It has 2 other databases running on k3s, which don't have dataloss, and as far as I can see it isn't a hardware failure.

Maybe someone can shed some light on this.

Things I checked:

  • Since it is on kubernetes, PVC is ReadWriteOnce, which should prevent the new pod accessing the directory while there were pending write actions.
    -- However, it is using local-path provisioner (not local, but the rancher / k3s default local-path), as far as I can tell it just maps a folder to the container/pod.
  • Deployment is set to type Recreate, to make sure a pod is terminated before the new one starts
  • fsync: well this currently works on the "standard" option of ZFS, so maybe this is something?
    NAME PROPERTY VALUE SOURCE
    poolname sync standard default
@arekpalinski
Copy link
Member

arekpalinski commented Mar 26, 2024

What is the version of RavenDB server?

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 26, 2024

Sorry was in the title 6.0.100

@ayende
Copy link
Member

ayende commented Mar 26, 2024

It looks very much like there are missing files there.

In short, we are running checkpoint at various time (last sync tx in this case) being 191.
And after we fsync the database file, we delete the journal file.

In this case, the last sync transaction (191) doesn't match the latest journal version 237.

Can you check if there is an older journal file?

@arekpalinski
Copy link
Member

For me it looks more like we read incorrect / old info from one of headers.one / headers.two files.

(The relevant code: https://github.com/ravendb/ravendb/blob/v6.0/src/Raven.Pal/src/posix/writefileheader.c)

@ayende
Copy link
Member

ayende commented Mar 26, 2024

The issue is that we fsync there before hand.
So there is inconsistency between the files.

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 26, 2024

/Databases/CUSTOMERNAME1/Journals]# ls
0000000000000000009.journal 0000000000000000010.journal

I got 20 databases,

  • 3 databases are unable to load and have the InvalidJournalException (see above)
  • 1 database has: System.IO.InvalidDataException: Invalid checksum for page 83180, data file /ravendb/Databases/CUSTOMER10/Raven.voron might be corrupted, expected hash to be 14653700836286606917 but was 13238343487099555791. Page flags: Single, RawData.
  • 5 databases are able to load and have around 4 faulty indexes (see above)
  • 1 database has: Index corruption occurred: Voron.Exceptions.VoronUnrecoverableErrorException: The page 5234263630915007311 was not allocated, allocated pages: 32768 in mmf64: /ravendb/Databases/CUSTOMER11/Indexes/PreOrderPerDebtorAll/Raven.voron
  • 10 are without error.

P.s. I'm almost certain that the faulty indexes weren't running when the OOM happened. They were changed during the uptime of the database.

Did I mention that this is on a ARM64 server.

@ayende
Copy link
Member

ayende commented Mar 26, 2024

ARM64 shouldn't matter.
Can you print the /Databases/CUSTOMERNAME1/ files including modification dates?

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 26, 2024

    7 -rw-------  1 dhcpcd dhcpcd        162 Mar 25 17:46 headers.one
     7 -rw-------  1 dhcpcd dhcpcd        162 Nov 29 05:30 headers.two
    12 drwxr-xr-x 71 dhcpcd dhcpcd         71 Jan 26 07:58 Indexes
     1 drwxr-xr-x  2 dhcpcd dhcpcd          4 Mar 25 17:46 Journals
     1 drwxr-xr-x  2 dhcpcd dhcpcd          2 Mar 13 15:56 PeriodicBackupTemp
718833 -rw-------  1 dhcpcd dhcpcd 1073741824 Mar 25 17:51 Raven.voron
    36 drwxr-xr-x  2 dhcpcd dhcpcd          4 Mar 26 12:47 Temp

@ayende
Copy link
Member

ayende commented Mar 27, 2024

     7 -rw-------  1 dhcpcd dhcpcd        162 Mar 25 17:46 headers.one
     7 -rw-------  1 dhcpcd dhcpcd        162 Nov 29 05:30 headers.two

This is very suspicious. headers.two behind from Nov? That doesn't make much sense.
What are the modification dates for the Journals as well?

The way that we work, we flip between writing to headers.one and headers.two each time, writing the latest version on each.
Nov write time is a good indication that this is not okay. Because if you have regular writes, that should be pretty much aligned. I would be very surprised to see this hours apart, certainly not days.

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 27, 2024

35910 -rw------- 1 dhcpcd dhcpcd 33554432 Mar 25 17:46 0000000000000000009.journal
28139 -rw------- 1 dhcpcd dhcpcd 33554432 Mar 25 17:51 0000000000000000010.journal

P.s. the last time I synced was the 29th of November. Back then this database had 50.000 docs, now it has 100.000 docs, so it grew quite a bit.

@ayende
Copy link
Member

ayende commented Mar 27, 2024

What do you mean synced, the last time you wrote to the database? If that is the case, Nov make sense then
Here is the problem, what we see is this:

35910 -rw------- 1 dhcpcd dhcpcd 33554432 Mar 25 17:46 0000000000000000009.journal
        7 -rw-------  1 dhcpcd dhcpcd         162 Mar 25 17:46 headers.one

So both of them written at roughly the same time. That indicates (properly) that there was a checkpoint at that time, but can't tell more without knowing exact (minute granularity isn't sufficient).

Basically, I'm highly suspicious that the order of writes as reflected on the volume after connecting to the new node isn't respected.

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 27, 2024

Synced:
I got a NoSQL database, from which I get the modified / new documents and BulkInsert them into this RavenDB database.

Files:

stat 0000000000000000009.journal 
  File: 0000000000000000009.journal
  Size: 33554432  	Blocks: 71820      IO Block: 16384  regular file
Device: 0,36	Inode: 199461      Links: 1
Access: (0600/-rw-------)  Uid: (  999/  dhcpcd)   Gid: (  999/  dhcpcd)
Access: 2023-12-13 12:39:38.998367688 +0000
Modify: 2024-03-25 17:46:47.951155737 +0000
Change: 2024-03-25 17:46:47.951155737 +0000
 Birth: 2023-12-13 12:39:38.998367688 +0000
stat 0000000000000000010.journal 
  File: 0000000000000000010.journal
  Size: 33554432  	Blocks: 56278      IO Block: 16384  regular file
Device: 0,36	Inode: 613767      Links: 1
Access: (0600/-rw-------)  Uid: (  999/  dhcpcd)   Gid: (  999/  dhcpcd)
Access: 2024-03-25 17:46:48.187157763 +0000
Modify: 2024-03-25 17:51:26.441548609 +0000
Change: 2024-03-25 17:51:26.441548609 +0000
 Birth: 2024-03-25 17:46:48.187157763 +0000
stat headers.one
  File: headers.one
  Size: 162       	Blocks: 13         IO Block: 512    regular file
Device: 0,36	Inode: 198431      Links: 1
Access: (0600/-rw-------)  Uid: (  999/  dhcpcd)   Gid: (  999/  dhcpcd)
Access: 2023-12-13 12:39:36.546346219 +0000
Modify: 2024-03-25 17:46:48.187157763 +0000
Change: 2024-03-25 17:46:48.187157763 +0000
 Birth: 2023-12-13 12:39:36.546346219 +0000

I don't get the "new node", it is all on a single server with local storage.

@ayende
Copy link
Member

ayende commented Mar 27, 2024

Can you send the headers.one file? (Not user data is located there, FYI)

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 27, 2024

hexdump headers.one
0000000 0015 0def aadc b16b 0017 0000 0012 0000
0000010 0000 0000 00bf 0000 0000 0000 efee 0000
0000020 0000 0000 0201 426d 0000 0000 0000 0001
0000030 0000 0000 0000 0004 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0005
0000050 0000 0000 0000 00ba 0000 0000 0000 0002
0000060 0000 000a 0000 0000 0000 0009 0000 0000
0000070 0000 00bf 0000 0000 0000 0000 0000 ffff
0000080 ffff ffff ffff ffff ffff ffff ffff 000a
0000090 0000 0000 0000 2000 0000 0b67 ce2e b2c2
00000a0 164a                                   
00000a2

@ayende
Copy link
Member

ayende commented Mar 28, 2024

Okay, this is weird. I looked into this, and it doesn't make sense.
The first part of a header file is a magic number, which should be: 0xB16BAADC0DEF0015
In the case above, it is: 0x6bb1dcaaef0d1500

The rest of the data there is nonsense as well. What is the value on headers.two? I assume that it would make more sense and have the right magic number.
That would make sense, then, because we would read the one file, see it is garbage, and go read the other.

The issue is that this indicates some problem in the files, which then moves the problem to the invariant violations that we already suspect.

For reference, here is the expected layout of the headers.* files.

The first 8 bytes should be: 0xB16BAADC0DEF0015
And bytes 8..12 should be: 23

Type layout for 'FileHeader'
Size: 162 bytes. Paddings: 8 bytes (%4 of empty space)
|=============================================================|
|   0-7: UInt64 MagicMarker (8 bytes)                         |
|-------------------------------------------------------------|
|  8-11: Int32 Version (4 bytes)                              |
|-------------------------------------------------------------|
| 12-19: Int64 HeaderRevision (8 bytes)                       |
|-------------------------------------------------------------|
| 20-27: Int64 TransactionId (8 bytes)                        |
|-------------------------------------------------------------|
| 28-35: Int64 LastPageNumber (8 bytes)                       |
|-------------------------------------------------------------|
| 36-97: TreeRootHeader Root (62 bytes)                       |
| |===============================================|           |
| |     0: RootObjectType RootObjectType (1 byte) |           |
| |-----------------------------------------------|           |
| |     1: TreeFlags Flags (1 byte)               |           |
| |-----------------------------------------------|           |
| |   2-9: Int64 RootPageNumber (8 bytes)         |           |
| |-----------------------------------------------|           |
| | 10-17: Int64 BranchPages (8 bytes)            |           |
| |-----------------------------------------------|           |
| | 18-25: Int64 LeafPages (8 bytes)              |           |
| |-----------------------------------------------|           |
| | 26-33: padding (8 bytes)                      |           |
| |-----------------------------------------------|           |
| | 34-41: Int64 OverflowPages (8 bytes)          |           |
| |-----------------------------------------------|           |
| | 42-49: Int64 PageCount (8 bytes)              |           |
| |-----------------------------------------------|           |
| | 50-57: Int64 NumberOfEntries (8 bytes)        |           |
| |-----------------------------------------------|           |
| | 58-61: Int32 Depth (4 bytes)                  |           |
| |===============================================|           |
|-------------------------------------------------------------|
| 98-125: JournalInfo Journal (28 bytes)                      |
| |====================================================|      |
| |   0-7: Int64 CurrentJournal (8 bytes)              |      |
| |----------------------------------------------------|      |
| |  8-15: Int64 LastSyncedJournal (8 bytes)           |      |
| |----------------------------------------------------|      |
| | 16-23: Int64 LastSyncedTransactionId (8 bytes)     |      |
| |----------------------------------------------------|      |
| | 24-26: <Reserved>e__FixedBuffer Reserved (3 bytes) |      |
| |----------------------------------------------------|      |
| |    27: JournalInfoFlags Flags (1 byte)             |      |
| |====================================================|      |
|-------------------------------------------------------------|
| 126-149: IncrementalBackupInfo IncrementalBackup (24 bytes) |
| |================================================|          |
| |   0-7: Int64 LastBackedUpJournal (8 bytes)     |          |
| |------------------------------------------------|          |
| |  8-15: Int64 LastBackedUpJournalPage (8 bytes) |          |
| |------------------------------------------------|          |
| | 16-23: Int64 LastCreatedJournal (8 bytes)      |          |
| |================================================|          |
|-------------------------------------------------------------|
| 150-153: Int32 PageSize (4 bytes)                           |
|-------------------------------------------------------------|
| 154-161: UInt64 Hash (8 bytes)                              |
|=============================================================|

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 29, 2024

hexdump headers.two 
0000000 0015 0def aadc b16b 0017 0000 0011 0000
0000010 0000 0000 00bf 0000 0000 0000 efee 0000
0000020 0000 0000 0201 426d 0000 0000 0000 0001
0000030 0000 0000 0000 0004 0000 0000 0000 0000
0000040 0000 0000 0000 0000 0000 0000 0000 0005
0000050 0000 0000 0000 00ba 0000 0000 0000 0002
0000060 0000 0009 0000 0000 0000 0009 0000 0000
0000070 0000 00bf 0000 0000 0000 0000 0000 ffff
0000080 ffff ffff ffff ffff ffff ffff ffff 0009
0000090 0000 0000 0000 2000 0000 014a ed65 df95
00000a0 07fe                                   
00000a2

@xinix00
Copy link
Contributor Author

xinix00 commented Mar 29, 2024

P.s. as a little note, xxd (vim) shows 1500 ef0d, not 0015 0def. So I don't know which one does little-endian or big-endian.

If I order your input the other way around I will get
Your: 0x6bb1dcaaef0d1500
Moved bytes: 0xb16baadc0def0015

Which equals your 0xB16BAADC0DEF0015

@ayende
Copy link
Member

ayende commented Mar 31, 2024

Can you send both of those files as binary (zip, etc)? I think that roundtripping through hexdump messed things up, yes.

@xinix00
Copy link
Contributor Author

xinix00 commented Apr 3, 2024

Archive.zip
Here you are.

@ayende
Copy link
Member

ayende commented Apr 4, 2024

Thanks, that was very helpful. The files are now workable.
Looking into this, headers.one is the newer version, and the last checkpoint is from journal 9, transaction 191.

Can you send us the journal files 9 and 10 for analysis (note, they contain some user data).
Feel free to send them directly to the support email

@xinix00
Copy link
Contributor Author

xinix00 commented Apr 4, 2024

You should have received it on your support mailbox.

@ayende
Copy link
Member

ayende commented Apr 4, 2024

Thanks, that was really interesting.
Here is my finding, in journal 9, we see transaction commits from 158 .. 191.
Representing byte ranges of 0 ... 12,967,936. Seems valid so far.
That is about 12.3MB out of a 32MB file.

The rest of the file is set to zeroes.

On journal 10, the first transaction is 237 all the way to 276, byte ranges 0 .. 26,300,416.
The rest are zero

This looks very much like the journal 9 file was zeroed somehow.
I have no good explanation as to why this can happen.

@xinix00
Copy link
Contributor Author

xinix00 commented Apr 8, 2024

Ok strange, did you notice that this happened on several databases, and that either indexes, raven.voron or the transaction files are corrupt.

Could this be ZFS related? Should I try to replicate (since this happened on multiple databases it might not be hard to reproduce)?

@ayende
Copy link
Member

ayende commented Apr 8, 2024

If you can replicate, that would be amazing yes.
I'm not aware of anything in ZFS that would cause this, but possible.

I'm leaning more toward the k8s setup

@xinix00
Copy link
Contributor Author

xinix00 commented Apr 8, 2024

k3s should be using local path, will read into this, but I guess it shouldn't be more as a volume mount in docker (which should directly write without any virtual layers). The storage is just local SSD storage with default ZFS configuration.

@ayende
Copy link
Member

ayende commented Apr 8, 2024

If everything is running on a single machine, then yes, that is really something that we would love to reproduce.
I assumed that this is when you moved storage nodes around.

@ayende
Copy link
Member

ayende commented Apr 14, 2024

Another issue, this is ARM64, and we are tracking a separate problem that may be related to bad JIT.
See: dotnet/runtime#100705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants