db: consider reworking rangedel, megingIter and levelIter interactions #2863

jbowens · 2023-08-30T18:00:08Z

Currently, there's an intricate coordination dance performed between mergingIter and levelIter in order to make range deletions work. This dance has several complexities that make it fragile and difficult to grok. The introduction of range-key masking and "truthful" block-property filters also forces us into some awkward contortions to deal with the fact that a file's point iterator and range deletion iterator may become exhausted at different times.

I suspect we could significantly clean things up by using the keyspan.InterleavingIter within the levelIter implementation. It would work something like:

If a file has range deletions, the levelIter initializes a keyspan.InterleavingIter with the file's point iterator and rangedel iterator. It configures the InterleavingIter to interleave both start and end boundaries of range deletions.
The levelIter keeps a file open until it exhausts the InterleavingIter. Since range deletions' start/end boundaries are interleaved, this ensures that we don't prematurely close the range deletion iterator during iteration. We can remove the code that manufactures "ignorable boundary keys." The interleaving iterator handles interleaving the rangedel end boundary if necessary.
The merging iterator no longer keeps a direct handle on a file's range deletion iterator. Instead, every mergingIterLevel has a getRangeDel() *keyspan.Span func that returns the range deletion covering that level's iterator's current position, if any. When a levelIter is setup, this is set to levelIter.Span which in turn calls InterelavingIter.Span.
The filteredIter interface and related tracking within the sstable iterator is removed. It's no longer necessary because range deletions are unconditionally interleaved, which ensures we never miss a range deletion due to block-property filters excluding point keys that exist beyond the smallest/largest range deletion bound.
The merging iterator iteration is adjusted to skip over range deletion boundaries that are interleaved and rise to the top of the heap, much like the existing code skips over the "ignorable boundary keys".
The mechanisms of applying range deletions also change. The mergingIterLevel no longer buffers a tombstone. It instead calls getRangeDel() which accesses the InterleavingIter's buffered tombstone instead. We remove mergingIter.init[Min,Max]RangeDelIters, because the InterleavingIter is transparently ensuring the range deletion iterators are appropriately positioned.

The memtable and indexed batch levels will also have to be adjusted to make use of a keyspan.InterleavingIter.

The text was updated successfully, but these errors were encountered:

jbowens · 2023-08-30T18:00:19Z

cc @sumeerbhola

For a couple years now, Pebble has not allowed the creation of untruncated range tombstones outside the context of virtual sstables. Additionally, since v22.2 untruncated range tombstones should all have been compacted and rewritten as truncated range tombstones. Virtual SSTables (which similarly allow a physical range deletion to exceed the bounds of the containing logical sstable) handle this case by truncating the tombstones at iteration time in the iterator returned by keyspan.Truncate. This leaves the merging iterator's delicate handling of untruncated range tombstones obsolete. This commit removes that complexity. In addition, as an extra safety precaution, the table stats collector's invariants-build assertion that range deletions are contained within their files' bounds is lifted into production builds as well. This provides a guarantee during store open that all tombstones are appropriately truncated. Relates to cockroachdb#2863.

This commit refactors the interface of the table cache `newIters` method to support creation of any of a sstable's iterators, including its point iterator, range deletion iterator and range key iterator. This is motivated by the refactor planned in cockroachdb#2863, which in turn requires refactoring some of the newIters call sites that determine whether ingested sstables overlap existing tables. There's also a minor benefit that a compaction setup no longer needs to open a point iterator and then immediately close it when initializing the range deletion iterators. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ sec/op │ sec/op vs base │ NewItersAlloc 427.0n ± 1% 412.1n ± 1% -3.47% (p=0.000 n=10) │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ B/op │ B/op vs base │ NewItersAlloc 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ allocs/op │ allocs/op vs base │ NewItersAlloc 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal ```

For a couple years now, Pebble has not allowed the creation of untruncated range tombstones outside the context of virtual sstables. Additionally, since v22.2 untruncated range tombstones should all have been compacted and rewritten as truncated range tombstones. Virtual SSTables (which similarly allow a physical range deletion to exceed the bounds of the containing logical sstable) handle this case by truncating the tombstones at iteration time in the iterator returned by keyspan.Truncate. This leaves the merging iterator's delicate handling of untruncated range tombstones obsolete. This commit removes that complexity. In addition, as an extra safety precaution, the table stats collector's invariants-build assertion that range deletions are contained within their files' bounds is lifted into production builds as well. This provides a guarantee during store open that all tombstones are appropriately truncated. Relates to #2863.

This commit refactors the interface of the table cache `newIters` method to support creation of any of a sstable's iterators, including its point iterator, range deletion iterator and range key iterator. This is motivated by the refactor planned in cockroachdb#2863, which in turn requires refactoring some of the newIters call sites that determine whether ingested sstables overlap existing tables. There's also a minor benefit that a compaction setup no longer needs to open a point iterator and then immediately close it when initializing the range deletion iterators. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ sec/op │ sec/op vs base │ NewItersAlloc 427.0n ± 1% 412.1n ± 1% -3.47% (p=0.000 n=10) │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ B/op │ B/op vs base │ NewItersAlloc 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ allocs/op │ allocs/op vs base │ NewItersAlloc 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal ```

This commit refactors the interface of the table cache `newIters` method to support creation of any of a sstable's iterators, including its point iterator, range deletion iterator and range key iterator. This is motivated by the refactor planned in #2863, which in turn requires refactoring some of the newIters call sites that determine whether ingested sstables overlap existing tables. There's also a minor benefit that a compaction setup no longer needs to open a point iterator and then immediately close it when initializing the range deletion iterators. ``` goos: linux goarch: amd64 pkg: github.com/cockroachdb/pebble cpu: Intel(R) Xeon(R) CPU @ 2.80GHz │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ sec/op │ sec/op vs base │ NewItersAlloc 427.0n ± 1% 412.1n ± 1% -3.47% (p=0.000 n=10) │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ B/op │ B/op vs base │ NewItersAlloc 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal │ old-NewItersAlloc.txt │ new-NewItersAlloc.txt │ │ allocs/op │ allocs/op vs base │ NewItersAlloc 0.000 ± 0% 0.000 ± 0% ~ (p=1.000 n=10) ¹ ¹ all samples are equal ```

The levelIter has a concept of ignorable boundary keys. When a levelIter exhausts a file's point keys, it's possible that the same file still contains range deletions relevant to other levels of the LSM. The levelIter will, under some conditions, interleave the largest boundary key of the sstable into iteration as an 'ignorable boundary key,' so that the file's range deletions remain accessible until all other levels progress beyound the file's boundary. When block-property filters are in use, a file's point key iterator may become exhausted early, before the file's range deletions are irrelevant, even if the file's largest key is not a range deletion. To work around this subtlety, the sstable iterator previously would surface knowledge of whether any point keys may have been skipped through a MaybeFilteredKeys method. The levelIter used this method to determine when to interleave an ignorable largest boundary in case there may be additional relevant range deletions. This commit removes the conditioning on the output of MaybeFilteredKeys, instead unconditionally interleaving a synthetic boundary at a file's largest point key if the levelIter user is using the file's range deletion iterator. This simplifies the logic and removes a fragile reliance on the sstable's iterators accounting of when it may have filtered keys. Future work (cockroachdb#2863) will remove the need to interleave synthetic boundaries at all, instead interleaving the range deletion bounds themselves among point keys. Informs cockroachdb#2863. Close cockroachdb#3334.

The levelIter has a concept of ignorable boundary keys. When a levelIter exhausts a file's point keys, it's possible that the same file still contains range deletions relevant to other levels of the LSM. The levelIter will, under some conditions, interleave the largest boundary key of the sstable into iteration as an 'ignorable boundary key,' so that the file's range deletions remain accessible until all other levels progress beyound the file's boundary. When block-property filters are in use, a file's point key iterator may become exhausted early, before the file's range deletions are irrelevant, even if the file's largest key is not a range deletion. To work around this subtlety, the sstable iterator previously would surface knowledge of whether any point keys may have been skipped through a MaybeFilteredKeys method. The levelIter used this method to determine when to interleave an ignorable largest boundary in case there may be additional relevant range deletions. This commit removes the conditioning on the output of MaybeFilteredKeys, instead unconditionally interleaving a synthetic boundary at a file's largest point key if the levelIter user is using the file's range deletion iterator. This simplifies the logic and removes a fragile reliance on the sstable's iterators accounting of when it may have filtered keys. Future work (#2863) will remove the need to interleave synthetic boundaries at all, instead interleaving the range deletion bounds themselves among point keys. Informs #2863. Close #3334.

This commit refactors the implementation of getIter to be a little more understandable and avoid the unnecessary use of levelIter. Current supported format major versions guarantee that a user key is not split across sstables within a level. This ensures that Get (which only retrieves one individual user key) need only consult 1 sstable per level. This is somewhat motivated by cockroachdb#2863. Removing getIter's dependency on levelIter will make that refactor easier.

This commit refactors the implementation of getIter to be a little more understandable and avoid the unnecessary use of levelIter. Current supported format major versions guarantee that a user key is not split across sstables within a level. This ensures that Get (which only retrieves one individual user key) need only consult 1 sstable per level. This is somewhat motivated by #2863. Removing getIter's dependency on levelIter will make that refactor easier.

Adapt the InterleavingIter to support a mode that interleaves span end keys. Informs cockroachdb#2863.

Adapt the InterleavingIter to support a mode that interleaves span end keys. Informs #2863.

The merging iterator and the level iterator are more tightly coupled than desired. Most of this coupling is a consequence of range deletions. The merging iterator is responsible for applying range deletions across levels and requires that the level iterator not progress to a new file until the file's range deletions are no longer relevant to other levels. It does this by interleaving "synthetic" keys. When these keys reach the top of the heap, it signals that all other levels have progressed to a point where the current file's range deletions are no longer relevant. Previously, knowledge of which keys were interleaved "synthetic" keys was propagated indirectly—outside the internal iterator interface—through a pointer to a levelIterBoundaryContext struct with boolean fields. This commit instead always interleaves synthetic keys as range deletion exclusive sentinels. The merging iterator can infer which keys are synthetic by calling InternalKey.IsExclusiveSentinel. Propagating this information directly through the internal iterator interface is more direct and understandable. Beyond needing to know which keys are synthetic, the merging iterator also must know when user-imposed iteration bounds have been reached. Ordinarily when bounds have been reached, internal iterators become exhausted. Since the levelIter's file's range deletions may still be relevant, it cannot and it interleaves a "synthetic" key at the iteration bound. When this key reaches the top of the merging iterator's heap, there are no more keys within bounds and the merging iterator is exhausted. To make this determination, we add a simple key comparison. The use of a key comparison is expected to be acceptable, because it's only performed when a synthetic key reaches the top of the heap. This additional key comparison should ultimately be eliminated when cockroachdb#2863 is complete. Informs cockroachdb#2863.

The merging iterator and the level iterator are more tightly coupled than desired. Most of this coupling is a consequence of range deletions. The merging iterator is responsible for applying range deletions across levels and requires that the level iterator not progress to a new file until the file's range deletions are no longer relevant to other levels. It does this by interleaving "synthetic" keys. When these keys reach the top of the heap, it signals that all other levels have progressed to a point where the current file's range deletions are no longer relevant. Previously, knowledge of which keys were interleaved "synthetic" keys was propagated indirectly—outside the internal iterator interface—through a pointer to a levelIterBoundaryContext struct with boolean fields. This commit instead always interleaves synthetic keys as range deletion exclusive sentinels. The merging iterator can infer which keys are synthetic by calling InternalKey.IsExclusiveSentinel. Propagating this information directly through the internal iterator interface is more direct and understandable. Beyond needing to know which keys are synthetic, the merging iterator also must know when user-imposed iteration bounds have been reached. Ordinarily when bounds have been reached, internal iterators become exhausted. Since the levelIter's file's range deletions may still be relevant, it cannot and it interleaves a "synthetic" key at the iteration bound. When this key reaches the top of the merging iterator's heap, there are no more keys within bounds and the merging iterator is exhausted. To make this determination, we add a simple key comparison. The use of a key comparison is expected to be acceptable, because it's only performed when a synthetic key reaches the top of the heap. This additional key comparison should ultimately be eliminated when #2863 is complete. Informs #2863.

Refactor the overlap checking to avoid using the levelIter to populate the range deletion iterator. This was subtle and confusing. This refactor will also benefit the implementation of cockroachdb#2112. Informs cockroachdb#2112. Informs cockroachdb#2863.

Refactor the overlap checking to avoid using the levelIter to populate the range deletion iterator. This was subtle and confusing. This refactor will also benefit the implementation of #2112. Informs #2112. Informs #2863.

Previously when the interleaving iterator interleaved a span boundary, the semantics of calling NextPrefix were indeterminate. Ordinarily NextPrefix guarantees that it advances to a key with a new prefix. The existing interleaving iterator implementation didn't guarantee this when positioned on a span boundary. Guaranteeing it would require otherwise unnecessary key comparisons. In existing use cases (for range keys, which always begin and end at prefix keys), we never call NextPrefix while positioned at a span boundary. With the planned refactor of cockroachdb#2863, we'll begin to use the interleaving iterator to interleave range deletions which may have bounds within the keys of a prefix. This assertion will ensure we don't unintentionally rely on currently undefined behavior of calling NextPrefix on a span boundary.

Previously, levelIter would sometimes interleave fake keys at file boundaries or user-provided iteration boundaries when a file contains range deletions. This commit reworks the levelIter to instead interleave the individual bounds of the range deletions themselves, using an interleaving iterator. This simplifies the levelIter. Because range deletions are now read in two places: once for interleaving bounds and once to expose a rangeDelIter to the mergingIter, this commit requires opening a file's range deletion iterator twice. This is an intermediary state until cockroachdb#2863 is completed when the mergingIter will consult the levelIter's interleaving iterator to determine which range deletions overlap the current iterator position. Informs cockroachdb#2863.

Previously when the interleaving iterator interleaved a span boundary, the semantics of calling NextPrefix were indeterminate. Ordinarily NextPrefix guarantees that it advances to a key with a new prefix. The existing interleaving iterator implementation didn't guarantee this when positioned on a span boundary. Guaranteeing it would require otherwise unnecessary key comparisons. In existing use cases (for range keys, which always begin and end at prefix keys), we never call NextPrefix while positioned at a span boundary. With the planned refactor of #2863, we'll begin to use the interleaving iterator to interleave range deletions which may have bounds within the keys of a prefix. This assertion will ensure we don't unintentionally rely on currently undefined behavior of calling NextPrefix on a span boundary.

Previously, levelIter would sometimes interleave fake keys at file boundaries or user-provided iteration boundaries when a file contains range deletions. This commit reworks the levelIter to instead interleave the individual bounds of the range deletions themselves, using an interleaving iterator. This simplifies the levelIter. Because range deletions are now read in two places: once for interleaving bounds and once to expose a rangeDelIter to the mergingIter, this commit requires opening a file's range deletion iterator twice. This is an intermediary state until cockroachdb#2863 is completed when the mergingIter will consult the levelIter's interleaving iterator to determine which range deletions overlap the current iterator position. Informs cockroachdb#2863.

Previously, levelIter would sometimes interleave fake keys at file boundaries or user-provided iteration boundaries when a file contains range deletions. This commit reworks the levelIter to instead interleave the individual bounds of the range deletions themselves, using an interleaving iterator. This simplifies the levelIter. Because range deletions are now read in two places: once for interleaving bounds and once to expose a rangeDelIter to the mergingIter, this commit requires opening a file's range deletion iterator twice. This is an intermediary state until #2863 is completed when the mergingIter will consult the levelIter's interleaving iterator to determine which range deletions overlap the current iterator position. Informs #2863.

Rework the merging iterator's handling of range deletions to use the interleaved range deletion boundary keys to determine when the iterator is positioned within a level's range deletion. This removes the direct manipulation of a range deletion keyspan.FragmentIterator from the mergingIter, delegating that to the child iterator's keyspan.InterleavingIter. This factoring is a little cleaner and decouples the mergingIter from the details of range deletion iterators, and in particular, the levelIter's individual sstables. It also should reduce key comparisons, especially during scans, by avoiding unnecessary key comparisons with range deletions that are loaded but outside the keyspace being iterated over. Close cockroachdb#2863.

jbowens added T-storage A-storage C-cleanup labels Aug 30, 2023

jbowens added this to Incoming in Storage via automation Aug 30, 2023

jbowens moved this from Incoming to 24.1 candidates in Storage Sep 11, 2023

jbowens mentioned this issue Jan 25, 2024

db: remove handling of untruncated range tombstones #3245

Merged

jbowens mentioned this issue Jan 26, 2024

db: refactor table cache newIters #3247

Merged

jbowens mentioned this issue Mar 29, 2024

db: unconditionally interleave ignorable boundaries #3475

Merged

jbowens mentioned this issue Apr 12, 2024

db: refactor Reader.Get #3507

Merged

jbowens mentioned this issue Apr 22, 2024

internal/keyspan: support interleaving end keys #3536

Merged

jbowens added a commit to jbowens/pebble that referenced this issue Apr 22, 2024

internal/keyspan: support interleaving end keys

69c70f2

Adapt the InterleavingIter to support a mode that interleaves span end keys. Informs cockroachdb#2863.

jbowens added a commit that referenced this issue Apr 22, 2024

internal/keyspan: support interleaving end keys

7704193

Adapt the InterleavingIter to support a mode that interleaves span end keys. Informs #2863.

jbowens mentioned this issue Apr 24, 2024

db: rework mergingIter and levelIter synthetic keys #3553

Merged

jbowens self-assigned this May 2, 2024

jbowens mentioned this issue May 2, 2024

db: refactor ingest overlap checking #3580

Merged

jbowens mentioned this issue May 6, 2024

internal/keyspan: prohibit NextPrefix at span boundary #3588

Merged

jbowens mentioned this issue May 7, 2024

db: interleave range deletion boundaries in levelIter #3596

Merged

jbowens linked a pull request May 10, 2024 that will close this issue

db: rework merging iterator range deletion handling #3600

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: consider reworking rangedel, megingIter and levelIter interactions #2863

db: consider reworking rangedel, megingIter and levelIter interactions #2863

jbowens commented Aug 30, 2023

jbowens commented Aug 30, 2023

db: consider reworking rangedel, megingIter and levelIter interactions #2863

db: consider reworking rangedel, megingIter and levelIter interactions #2863

Comments

jbowens commented Aug 30, 2023

jbowens commented Aug 30, 2023