Fix head stats and hooks when replaying a corrupted snapshot #14079

alanprot · 2024-05-10T21:57:04Z

When loading a snapshot and encountering a corrupted chunk, we discard previously loaded series from the snapshot and resort to replaying the wall. In such cases, we were not resetting the number of series in the head, leading to double counting them.

Additionally, we did not invoke the PostDeletion hook when resetting the memory - this needs to be called as the PostCreation was called for the series which we were able to replay from the snapshot but were subsequently discarded.

yeya24

Thanks make sense to me.

In Cortex, we rely on series lifecycle callback to keep track of active series. We hit this series double counting bug when getting a corrupted chunk and active series reached limit because post delete hook was not called.

bboreham

Thanks for this; looks fairly good but some comments from the bug-scrub meeting.

bboreham · 2024-05-14T11:45:36Z

tsdb/head.go

+		totalSeries += len(deletedForCallback)
+		deletedFromPrevStripe = len(deletedForCallback)
+	}
+	s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size)


This seems a no-op because it is overwritten on line 319?

Yeah.. it is.. i just wanted to do that for correctness of the "reset" method - reset seems that we are resetting the struct to the initial state (empty) and could be reused. WDYT?

I can no do that and rename the method to something else (maybe flush? clean?)

Updated the code:

Renamed the method from reset to flush

Removed the extra logic to clean the series

I can still see s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size), is it still needed?
I added some comments below regarding iter, below.
As its invocation would become simpler, we can just move it directly to resetInMemoryState() with a comment.
(And entrust the task of naming to the person who will extract this code in the future :))

My suggestion: get rid of flush() and inline it in resetInMemoryState() without this allocation. Or remove this line of s.series = ... and probably rename this function to callPostDeletionForAll().

ok.. make sense!

Ok.. i just did that! its better indeed!

tsdb/head.go

bboreham · 2024-05-14T11:48:48Z

tsdb/head_test.go

@@ -4007,24 +4008,44 @@ func TestSnapshotError(t *testing.T) {
 	require.NoError(t, err)
 	f, err := os.OpenFile(path.Join(snapDir, files[0].Name()), os.O_RDWR, 0)
 	require.NoError(t, err)
-	_, err = f.WriteAt([]byte{0b11111111}, 18)
+	// lets corrupt middle of the snapshot, so we can replay some entries
+	_, err = f.WriteAt([]byte{0b11111111}, 300)


Is the change from 18 to 300 significant?

Yeah.. it is..

If we corrupt the byte 18, we will not be able to restore any series (and so we cannot see the problem).

Corrupting the byte 300, we are able to restore 2 timeseries before reaching the corrupted position, and so, highlight the problem.

So we don’t make it appear as if we’ve deleted a test (we'd want to continue running the other checks when no series has been restored) Let’s add that as another scenario. You can simply recreate a head at the end and run the new checks related to callbacks, or even better, run all checks for the two cases.

I will need to create a new snapshot as the old one is already corrupted in the beginning. Will do

I backed up the snapshot and restored to create the other test case! PTAL?

alanprot · 2024-05-16T20:43:43Z

@bboreham PTAL?

jeromeinsf · 2024-05-22T18:32:29Z

@codesome is this something you could help review?

machine424

Nice catch!

machine424 · 2024-05-24T08:08:44Z

tsdb/head.go

+	return deleted, rmChunks, actualMint, minOOOTime, minMmapFile
+}
+
+func (s *stripeSeries) iter(f func(int, uint64, *memSeries, map[chunks.HeadSeriesRef]labels.Labels), endShard func(map[chunks.HeadSeriesRef]labels.Labels)) {


I think we can make it more specific (by renaming it deleteFunc or iterForDeletion or sth else + some comments), call PostDeletion inside it and make it return the number of deleted series.

we can also note in the comments what f should do with the map.

Let's not have "deletion" in the name, because this function in itself has nothing to do with deletion. The users of this just happen to use it for deletion.

If we dont have deletion on the name we cannot call the post deletion hook inside the function! =/ What u guys think is better here? TBH i like the iterForDeletion as it makes the code cleaner!

tsdb/head.go

machine424 · 2024-05-24T08:29:00Z

tsdb/head.go

+		totalSeries += len(deletedForCallback)
+		deletedFromPrevStripe = len(deletedForCallback)
+	}
+	s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size)


I can still see s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size), is it still needed?
I added some comments below regarding iter, below.
As its invocation would become simpler, we can just move it directly to resetInMemoryState() with a comment.
(And entrust the task of naming to the person who will extract this code in the future :))

tsdb/head_test.go

machine424 · 2024-05-24T08:57:36Z

tsdb/head_test.go

@@ -4007,24 +4008,44 @@ func TestSnapshotError(t *testing.T) {
 	require.NoError(t, err)
 	f, err := os.OpenFile(path.Join(snapDir, files[0].Name()), os.O_RDWR, 0)
 	require.NoError(t, err)
-	_, err = f.WriteAt([]byte{0b11111111}, 18)
+	// lets corrupt middle of the snapshot, so we can replay some entries
+	_, err = f.WriteAt([]byte{0b11111111}, 300)


So we don’t make it appear as if we’ve deleted a test (we'd want to continue running the other checks when no series has been restored) Let’s add that as another scenario. You can simply recreate a head at the end and run the new checks related to callbacks, or even better, run all checks for the two cases.

Signed-off-by: alanprot <alanprot@gmail.com>

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

codesome

Checking the unit tests after this

codesome · 2024-05-24T19:17:32Z

tsdb/head.go

+		totalSeries += len(deletedForCallback)
+		deletedFromPrevStripe = len(deletedForCallback)
+	}
+	s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size)


My suggestion: get rid of flush() and inline it in resetInMemoryState() without this allocation. Or remove this line of s.series = ... and probably rename this function to callPostDeletionForAll().

tsdb/head.go

codesome

Just one comment on tests and the above comments, looks good otherwise. Thanks!

tsdb/head_test.go

Signed-off-by: alanprot <alanprot@gmail.com>

codesome

LGTM. Small nits.

tsdb/head_test.go

Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

machine424 · 2024-05-25T12:52:59Z

tsdb/head.go

+		// and increment the series removed metrics
+		fs := h.series.iterForDeletion(func(_ int, _ uint64, s *memSeries, flushedForCallback map[chunks.HeadSeriesRef]labels.Labels) {
+			// All series should be flushed
+			flushedForCallback[s.ref] = s.lset


I forgot to submit this comment that was more of a question:
I don't know if we should lock s here, I don't know if we could have any races.

iterForDeletion is already locking here, so should be good?

I was talking about this lock

prometheus/tsdb/head.go

Line 1883 in e6f1f7e

series.Lock()

actually.

Hum..

Do you think is needed? I can add those locks but right now i dont see a case where they are racing on the reset method.

On the check function we are reading and modifying the series at the same time we are possibly appending more samples, but maybe its not the case on the reset? I can still add just to be safe though.

machine424 · 2024-05-25T12:53:38Z

Thanks @alanprot, this lgtm ;)

alanprot requested a review from jesusvazquez as a code owner May 10, 2024 21:57

alanprot force-pushed the fix-corrupted-snapshot-callbacks branch 4 times, most recently from f8d34ad to 6ee4190 Compare May 10, 2024 23:23

yeya24 approved these changes May 11, 2024

View reviewed changes

alanprot force-pushed the fix-corrupted-snapshot-callbacks branch 3 times, most recently from cb8c835 to 61bf6b6 Compare May 13, 2024 18:19

bboreham reviewed May 14, 2024

View reviewed changes

alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from 0b2ec4b to 70a26fd Compare May 14, 2024 16:24

alanprot requested a review from bboreham May 21, 2024 17:11

machine424 reviewed May 24, 2024

View reviewed changes

alanprot and others added 4 commits May 24, 2024 12:32

Fixing head stats and hooks when replaying a corrupted snapshot

3ae7efe

Signed-off-by: alanprot <alanprot@gmail.com>

Fixing create/removed series metrics

97f4263

Signed-off-by: alanprot <alanprot@gmail.com>

Refactoring to have common code between gc and flush method

250271a

Signed-off-by: alanprot <alanprot@gmail.com>

Update tsdb/head.go

37aae47

Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from a14e433 to 37aae47 Compare May 24, 2024 19:32

codesome reviewed May 24, 2024

View reviewed changes

tsdb/head_test.go Outdated Show resolved Hide resolved

alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from cbc345d to c19fb3c Compare May 24, 2024 21:15

refactor

ad98c84

Signed-off-by: alanprot <alanprot@gmail.com>

alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from c19fb3c to ad98c84 Compare May 24, 2024 21:17

codesome approved these changes May 24, 2024

View reviewed changes

tsdb/head_test.go Outdated Show resolved Hide resolved

tsdb/head_test.go Outdated Show resolved Hide resolved

alanprot and others added 2 commits May 24, 2024 14:41

Update tsdb/head_test.go

3d6b15d

Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

Update tsdb/head_test.go

f5fb207

Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com>

codesome merged commit 8894d65 into prometheus:main May 25, 2024
25 checks passed

machine424 reviewed May 25, 2024

View reviewed changes

gotjosh mentioned this pull request May 30, 2024

Merge upstream prometheus/prometheus at 37b408c grafana/mimir-prometheus#639

Merged

pracucci mentioned this pull request Jun 6, 2024

Update vendored prometheus grafana/mimir#8295

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix head stats and hooks when replaying a corrupted snapshot #14079

Fix head stats and hooks when replaying a corrupted snapshot #14079

alanprot commented May 10, 2024 •

edited

yeya24 left a comment

bboreham left a comment

bboreham May 14, 2024

alanprot May 14, 2024 •

edited

alanprot May 14, 2024

machine424 May 24, 2024

codesome May 24, 2024

alanprot May 24, 2024

alanprot May 24, 2024

bboreham May 14, 2024

alanprot May 14, 2024

machine424 May 24, 2024

alanprot May 24, 2024

alanprot May 24, 2024

alanprot commented May 16, 2024

jeromeinsf commented May 22, 2024

machine424 left a comment

machine424 May 24, 2024

machine424 May 24, 2024

codesome May 24, 2024

alanprot May 24, 2024 •

edited

machine424 May 24, 2024

machine424 May 24, 2024

codesome left a comment

codesome May 24, 2024

codesome left a comment •

edited

codesome left a comment

machine424 May 25, 2024

alanprot May 27, 2024

machine424 May 27, 2024

alanprot May 27, 2024

machine424 commented May 25, 2024

Fix head stats and hooks when replaying a corrupted snapshot #14079

Fix head stats and hooks when replaying a corrupted snapshot #14079

Conversation

alanprot commented May 10, 2024 • edited

yeya24 left a comment

Choose a reason for hiding this comment

bboreham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanprot May 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanprot commented May 16, 2024

jeromeinsf commented May 22, 2024

machine424 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanprot May 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codesome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codesome left a comment • edited

Choose a reason for hiding this comment

codesome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machine424 commented May 25, 2024

alanprot commented May 10, 2024 •

edited

alanprot May 14, 2024 •

edited

alanprot May 24, 2024 •

edited

codesome left a comment •

edited