Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix head stats and hooks when replaying a corrupted snapshot #14079

Merged
merged 7 commits into from
May 25, 2024

Conversation

alanprot
Copy link
Contributor

@alanprot alanprot commented May 10, 2024

When loading a snapshot and encountering a corrupted chunk, we discard previously loaded series from the snapshot and resort to replaying the wall. In such cases, we were not resetting the number of series in the head, leading to double counting them.

Additionally, we did not invoke the PostDeletion hook when resetting the memory - this needs to be called as the PostCreation was called for the series which we were able to replay from the snapshot but were subsequently discarded.

@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch 4 times, most recently from f8d34ad to 6ee4190 Compare May 10, 2024 23:23
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks make sense to me.

In Cortex, we rely on series lifecycle callback to keep track of active series. We hit this series double counting bug when getting a corrupted chunk and active series reached limit because post delete hook was not called.

@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch 3 times, most recently from cb8c835 to 61bf6b6 Compare May 13, 2024 18:19
Copy link
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this; looks fairly good but some comments from the bug-scrub meeting.

tsdb/head.go Outdated
totalSeries += len(deletedForCallback)
deletedFromPrevStripe = len(deletedForCallback)
}
s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a no-op because it is overwritten on line 319?

Copy link
Contributor Author

@alanprot alanprot May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. it is.. i just wanted to do that for correctness of the "reset" method - reset seems that we are resetting the struct to the initial state (empty) and could be reused. WDYT?

I can no do that and rename the method to something else (maybe flush? clean?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code:

  • Renamed the method from reset to flush
  • Removed the extra logic to clean the series

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still see s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size), is it still needed?
I added some comments below regarding iter, below.
As its invocation would become simpler, we can just move it directly to resetInMemoryState() with a comment.
(And entrust the task of naming to the person who will extract this code in the future :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion: get rid of flush() and inline it in resetInMemoryState() without this allocation. Or remove this line of s.series = ... and probably rename this function to callPostDeletionForAll().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.. make sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.. i just did that! its better indeed!

tsdb/head.go Outdated Show resolved Hide resolved
@@ -4007,24 +4008,44 @@ func TestSnapshotError(t *testing.T) {
require.NoError(t, err)
f, err := os.OpenFile(path.Join(snapDir, files[0].Name()), os.O_RDWR, 0)
require.NoError(t, err)
_, err = f.WriteAt([]byte{0b11111111}, 18)
// lets corrupt middle of the snapshot, so we can replay some entries
_, err = f.WriteAt([]byte{0b11111111}, 300)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change from 18 to 300 significant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. it is..

If we corrupt the byte 18, we will not be able to restore any series (and so we cannot see the problem).

Corrupting the byte 300, we are able to restore 2 timeseries before reaching the corrupted position, and so, highlight the problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don’t make it appear as if we’ve deleted a test (we'd want to continue running the other checks when no series has been restored) Let’s add that as another scenario. You can simply recreate a head at the end and run the new checks related to callbacks, or even better, run all checks for the two cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to create a new snapshot as the old one is already corrupted in the beginning. Will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I backed up the snapshot and restored to create the other test case! PTAL?

@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from 0b2ec4b to 70a26fd Compare May 14, 2024 16:24
@alanprot
Copy link
Contributor Author

@bboreham PTAL?

@alanprot alanprot requested a review from bboreham May 21, 2024 17:11
@jeromeinsf
Copy link

@codesome is this something you could help review?

Copy link
Collaborator

@machine424 machine424 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

tsdb/head.go Outdated
return deleted, rmChunks, actualMint, minOOOTime, minMmapFile
}

func (s *stripeSeries) iter(f func(int, uint64, *memSeries, map[chunks.HeadSeriesRef]labels.Labels), endShard func(map[chunks.HeadSeriesRef]labels.Labels)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make it more specific (by renaming it deleteFunc or iterForDeletion or sth else + some comments), call PostDeletion inside it and make it return the number of deleted series.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also note in the comments what f should do with the map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not have "deletion" in the name, because this function in itself has nothing to do with deletion. The users of this just happen to use it for deletion.

Copy link
Contributor Author

@alanprot alanprot May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we dont have deletion on the name we cannot call the post deletion hook inside the function! =/ What u guys think is better here? TBH i like the iterForDeletion as it makes the code cleaner!

tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
tsdb/head.go Outdated
totalSeries += len(deletedForCallback)
deletedFromPrevStripe = len(deletedForCallback)
}
s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still see s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size), is it still needed?
I added some comments below regarding iter, below.
As its invocation would become simpler, we can just move it directly to resetInMemoryState() with a comment.
(And entrust the task of naming to the person who will extract this code in the future :))

tsdb/head_test.go Outdated Show resolved Hide resolved
@@ -4007,24 +4008,44 @@ func TestSnapshotError(t *testing.T) {
require.NoError(t, err)
f, err := os.OpenFile(path.Join(snapDir, files[0].Name()), os.O_RDWR, 0)
require.NoError(t, err)
_, err = f.WriteAt([]byte{0b11111111}, 18)
// lets corrupt middle of the snapshot, so we can replay some entries
_, err = f.WriteAt([]byte{0b11111111}, 300)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don’t make it appear as if we’ve deleted a test (we'd want to continue running the other checks when no series has been restored) Let’s add that as another scenario. You can simply recreate a head at the end and run the new checks related to callbacks, or even better, run all checks for the two cases.

alanprot and others added 4 commits May 24, 2024 12:32
Signed-off-by: alanprot <alanprot@gmail.com>
Signed-off-by: alanprot <alanprot@gmail.com>
Signed-off-by: alanprot <alanprot@gmail.com>
Co-authored-by: Ayoub Mrini <ayoubmrini424@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from a14e433 to 37aae47 Compare May 24, 2024 19:32
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the unit tests after this

tsdb/head.go Outdated
totalSeries += len(deletedForCallback)
deletedFromPrevStripe = len(deletedForCallback)
}
s.series = make([]map[chunks.HeadSeriesRef]*memSeries, s.size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion: get rid of flush() and inline it in resetInMemoryState() without this allocation. Or remove this line of s.series = ... and probably rename this function to callPostDeletionForAll().

tsdb/head.go Show resolved Hide resolved
tsdb/head.go Outdated Show resolved Hide resolved
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment on tests and the above comments, looks good otherwise. Thanks!

tsdb/head_test.go Outdated Show resolved Hide resolved
@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from cbc345d to c19fb3c Compare May 24, 2024 21:15
Signed-off-by: alanprot <alanprot@gmail.com>
@alanprot alanprot force-pushed the fix-corrupted-snapshot-callbacks branch from c19fb3c to ad98c84 Compare May 24, 2024 21:17
Copy link
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Small nits.

tsdb/head_test.go Outdated Show resolved Hide resolved
tsdb/head_test.go Outdated Show resolved Hide resolved
alanprot and others added 2 commits May 24, 2024 14:41
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
@codesome codesome merged commit 8894d65 into prometheus:main May 25, 2024
25 checks passed
// and increment the series removed metrics
fs := h.series.iterForDeletion(func(_ int, _ uint64, s *memSeries, flushedForCallback map[chunks.HeadSeriesRef]labels.Labels) {
// All series should be flushed
flushedForCallback[s.ref] = s.lset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to submit this comment that was more of a question:
I don't know if we should lock s here, I don't know if we could have any races.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterForDeletion is already locking here, so should be good?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about this lock

series.Lock()
actually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum..

Do you think is needed? I can add those locks but right now i dont see a case where they are racing on the reset method.

On the check function we are reading and modifying the series at the same time we are possibly appending more samples, but maybe its not the case on the reset? I can still add just to be safe though.

@machine424
Copy link
Collaborator

Thanks @alanprot, this lgtm ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants