New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat docs with recovery_source as deletes in merges #107979
base: main
Are you sure you want to change the base?
Conversation
18efbc2
to
69f75ce
Compare
I benchmarked this change with the tsdb track and noticed that we triggered merges too eagerly. I'll investigate further. Two things I'm pondering: (1) treating a document with recovery source as 1/2 deletes, and (2) accounting for them only in force-merges.
|
There is a bug in Lucene merges. I've opened: apache/lucene#13324 |
Adrien kindly discussed this offline with me. Overriding |
} | ||
|
||
@Override | ||
public int numDeletesToMerge(SegmentCommitInfo info, int delCount, IOSupplier<CodecReader> readerSupplier) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's dangerous to piggyback on the number of deletes here. I think we should rather think about having a notion of how much space we free and use that in the MP upstream to make decisions in what to merge? I wanna prevent that we get illegal invariants in terms of the promises we make here on how many docs will be deleted and then after the fact it's not true and the new segment suddenly grew unexpectedly?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @s1monw. That's a great suggestion. I'll implement it.
Currently, we do not account the number of documents with 'recovery_source' ready to drop when selecting merge specifications. Previously, a single large segment containing 'recovery_source' documents was considered fully merged, even though Elasticsearch should have triggered a merge to remove them.
With this PR, we're adjusting the merge policy to treat documents with 'recovery_source' ready to drop as deletions when determining merge specifications. Essentially, documents with 'recovery_source' are now treated as soft-deleted by the Elasticsearch merge policy.
We will need a follow-up to trigger merges when the retention leases advance enough to drop soft-deletes and recovery_source.