objstorage,sstable: add read-before for reader creation and iter index/filter blocks #3608

sumeerbhola · 2024-05-11T02:22:40Z

We consider two use cases for read-before when using a
objstorageprovider.remoteReadable, given the high latency (and cost) of
each read operation:

When a sstable.Reader is opened, it needs to read the footer, metaindex
block and meta properties block. It starts by reading the footer which is
at the end of the table and then proceeds to read the other two. Instead
of doing 3 tiny reads, we would like to do one read.
When a single-level or two-level iterator is opened, it reads the
(top-level) index block first. When the iterator is used, it will
typically follow this by reading the filter block (since SeeKPrefixGE is
common in CockroachDB). For a two-level iterator it will also read the
lower-level index blocks which are after the filter block and before the
top-level index block. It would be ideal if reading the top-level index
block read enough to include the filter block. And for two-level
iterators this would also include the lower-level index blocks.

In both use-cases we want the first read from the remoteReadable to do a
larger read, and read bytes earlier than the requested read, hence
"read-before". Subsequent reads from the remoteReadable can use the usual
readahead logic (for the second use-case above, this can help with
sequential reads of the lower-level index blocks when the read-before was
insufficient to satisfy such reads).

Since remoteReadHandle already has a buffer for read-ahead, we utilize
it for this read-before buffering pattern too.

While here, we bump up the read-ahead size for compactions to 8MB, given
the lower concurrency (and thereby higher tolerance to memory usage).

Informs #2328

cockroach-teamcity · 2024-05-11T02:22:47Z

This change is

sumeerbhola

First commit is from #3607.

I'll add a test if the approach looks ok.

Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal)

RaduBerinde

This is a cool idea, I think it's worth doing it.

Reviewable status: 0 of 20 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

objstorage/objstorage.go line 41 at r2 (raw file):

	//
	// Multiple separate ReadHandles can be used.
	NewReadHandle(ctx context.Context, readBeforeSize ReadBeforeSize) ReadHandle

I wonder if this would be cleaner (and less intrusive) as a new method on ReadHandle, where it advises what happens on the next read. It would also be more flexible, where we could use it on more than the firs tread.

objstorage/objstorage.go line 48 at r2 (raw file):

// just a suggestion that the callee can ignore (and does ignore in
// fileReadable).
type ReadBeforeSize int64

The semantics of this seem pretty clear here but in the implementation this is not the number of bytes we read before, but the total size of the extended read. We should document the semantics correctly here. Perhaps also rename to LeftExtendedReadSize.

Or perhaps reframe it as an "area of interest" with an offset and a length (especially if we switch to a method on the ReadHandle).

objstorage/objstorage.go line 56 at r2 (raw file):

	// metaindex, properties. 32KB is unnecessarily large, but it is still small
	// when considering remote object storage.
	ReadBeforeForNewReader = 32 << 10

[nit] I find 32 * 1024 easier to read

objstorage/objstorageprovider/remote_readable.go line 176 at r2 (raw file):

	// Also, since this is the first call, the buffer must be empty.
	if readBeforeSize > 0 {
		r.buffered.offset = offset - int64(readBeforeSize-len(p))

Since the exact size is speculative and we have some leeway, perhaps we should align the offset to a 4kb boundary?

sumeerbhola

TFTR!

Reviewable status: 0 of 20 files reviewed, 3 unresolved discussions (waiting on @itsbilal and @RaduBerinde)

objstorage/objstorage.go line 41 at r2 (raw file):

Previously, RaduBerinde wrote…

I wonder if this would be cleaner (and less intrusive) as a new method on ReadHandle, where it advises what happens on the next read. It would also be more flexible, where we could use it on more than the firs tread.

Good point. I considered it, but the difficulty is that the logic of remoteReadHandle.ReadAt becomes more complicated when doing this for a later read. We need to then account for how much of the read is buffered, and if both read-ahead and read-before is active, which one to prefer, or do both. And there is no use case that guides us on what the behavior ought to be. With the first read, we know there is no read-ahead and nothing in the buffer. We can definitely extend this when there is a use-case, but for these reasons I've left this unchanged for now.

objstorage/objstorage.go line 48 at r2 (raw file):

Previously, RaduBerinde wrote…

The semantics of this seem pretty clear here but in the implementation this is not the number of bytes we read before, but the total size of the extended read. We should document the semantics correctly here. Perhaps also rename to LeftExtendedReadSize.

Or perhaps reframe it as an "area of interest" with an offset and a length (especially if we switch to a method on the ReadHandle).

I added the following comment since what is happening here is quite akin to how readahead works, at least in remoteReadHandle.

// When 0, the first read will only read what it is asked to read, say n
// bytes. When it is a value b > 0, if b > n, then the read will be padded by
// an additional b-n bytes to the left, resulting in an overall read size of
// b. This behavior is akin to what the read-ahead implementation does -- when
// the n bytes are not buffered, and there is read-ahead of b > n, the read
// length is b bytes.

I didn't change the name because of symmetry with readahead and some similarity.

objstorage/objstorage.go line 56 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I find 32 * 1024 easier to read

Done

objstorage/objstorageprovider/remote_readable.go line 176 at r2 (raw file):

Previously, RaduBerinde wrote…

Since the exact size is speculative and we have some leeway, perhaps we should align the offset to a 4kb boundary?

ext4 block size is 4KB, and also the sector size for typical disks. But this is remote blob storage, which typically packs blobs into the same file, and has unknown alignment.

Our configured readBeforeSizes are multiples of 4KB and effective readBeforeSize will likely be the same (since the offset will typically be large compared to the configured readBeforeSize).

So I don't quite understand the benefit. Could you elaborate?

RaduBerinde

Reviewable status: 0 of 20 files reviewed, 3 unresolved discussions (waiting on @itsbilal and @sumeerbhola)

objstorage/objstorage.go line 41 at r2 (raw file):

Previously, sumeerbhola wrote…

Good point. I considered it, but the difficulty is that the logic of remoteReadHandle.ReadAt becomes more complicated when doing this for a later read. We need to then account for how much of the read is buffered, and if both read-ahead and read-before is active, which one to prefer, or do both. And there is no use case that guides us on what the behavior ought to be. With the first read, we know there is no read-ahead and nothing in the buffer. We can definitely extend this when there is a use-case, but for these reasons I've left this unchanged for now.

Got it. We already have a method SetupForCompaction() that reconfigures the handle. We should be consistent and either 1) also use a method for read-before (with the specification that it can only be called before the handle is used) or 2) add a ReadHandleOptions struct and move "for compaction" to that.

objstorage/objstorageprovider/remote_readable.go line 163 at r2 (raw file):

// ReadAt is part of the objstorage.ReadHandle interface.
func (r *remoteReadHandle) ReadAt(ctx context.Context, p []byte, offset int64) error {
	readBeforeSize := int(r.readBeforeSize)

[nit] It would be easier if we calculated exactly how many extra bytes we're reading:

var extraBytesBefore int64
if r.readBeforeSize > 0 {
  if r.readBeforeSize > len(p) {
    extraBytesBefore = min(int64(r.readBeforeSize - len(p)), offset)
  }
  r.readBeforeSize = 0
}

objstorage/objstorageprovider/remote_readable.go line 176 at r2 (raw file):

Previously, sumeerbhola wrote…

ext4 block size is 4KB, and also the sector size for typical disks. But this is remote blob storage, which typically packs blobs into the same file, and has unknown alignment.

Our configured readBeforeSizes are multiples of 4KB and effective readBeforeSize will likely be the same (since the offset will typically be large compared to the configured readBeforeSize).

So I don't quite understand the benefit. Could you elaborate?

Sorry, I guess I spaced out and was thinking of regular storage

objstorage/objstorageprovider/remote_readable.go line 182 at r2 (raw file):

		}
		// TODO(radu): we need to somehow account for this memory.
		if cap(r.buffered.data) >= readBeforeSize {

[nit] I'd move this block to a helper function readToBuffer(offset, length) and use it below as well

sumeerbhola

Test is ready.

Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @RaduBerinde)

objstorage/objstorage.go line 41 at r2 (raw file):

Previously, RaduBerinde wrote…

Got it. We already have a method SetupForCompaction() that reconfigures the handle. We should be consistent and either 1) also use a method for read-before (with the specification that it can only be called before the handle is used) or 2) add a ReadHandleOptions struct and move "for compaction" to that.

I've added a TODO

// TODO(sumeer): both readBeforeSize and ReadHandle.SetupForCompaction are
// initial configuration of a ReadHandle. So they should both be passed as
// Options to NewReadHandle. But currently the latter is a separate method.
// This is because of how the sstable.Reader calls setupForCompaction on the
// iterators after they are constructed. Consider fixing this oddity.

objstorage/objstorageprovider/remote_readable.go line 163 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] It would be easier if we calculated exactly how many extra bytes we're reading:
var extraBytesBefore int64
if r.readBeforeSize > 0 {
  if r.readBeforeSize > len(p) {
    extraBytesBefore = min(int64(r.readBeforeSize - len(p)), offset)
  }
  r.readBeforeSize = 0
}

Done

objstorage/objstorageprovider/remote_readable.go line 182 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I'd move this block to a helper function readToBuffer(offset, length) and use it below as well

Done

RaduBerinde

Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)

objstorage/objstorageprovider/remote_readable_test.go line 25 at r4 (raw file):

}

func (r *testObjectReader) init(size int) {

[nit] I think we could have used a logging in mem remote storage for this. No need to change anything heree, just mentioning it for the future.

sumeerbhola

Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @RaduBerinde)

objstorage/objstorageprovider/remote_readable_test.go line 25 at r4 (raw file):

Previously, RaduBerinde wrote…

[nit] I think we could have used a logging in mem remote storage for this. No need to change anything heree, just mentioning it for the future.

Thanks for the pointer.

…x/filter blocks We consider two use cases for read-before when using a objstorageprovider.remoteReadable, given the high latency (and cost) of each read operation: - When a sstable.Reader is opened, it needs to read the footer, metaindex block and meta properties block. It starts by reading the footer which is at the end of the table and then proceeds to read the other two. Instead of doing 3 tiny reads, we would like to do one read. - When a single-level or two-level iterator is opened, it reads the (top-level) index block first. When the iterator is used, it will typically follow this by reading the filter block (since SeeKPrefixGE is common in CockroachDB). For a two-level iterator it will also read the lower-level index blocks which are after the filter block and before the top-level index block. It would be ideal if reading the top-level index block read enough to include the filter block. And for two-level iterators this would also include the lower-level index blocks. In both use-cases we want the first read from the remoteReadable to do a larger read, and read bytes earlier than the requested read, hence "read-before". Subsequent reads from the remoteReadable can use the usual readahead logic (for the second use-case above, this can help with sequential reads of the lower-level index blocks when the read-before was insufficient to satisfy such reads). Since remoteReadHandle already has a buffer for read-ahead, we utilize it for this read-before buffering pattern too. While here, we bump up the read-ahead size for compactions to 8MB, given the lower concurrency (and thereby higher tolerance to memory usage). Informs cockroachdb#2328

sumeerbhola requested a review from a team as a code owner May 11, 2024 02:22

sumeerbhola requested a review from itsbilal May 11, 2024 02:22

sumeerbhola commented May 11, 2024

View reviewed changes

sumeerbhola force-pushed the bigger_read branch 2 times, most recently from 1bc0866 to ba1b9d3 Compare May 12, 2024 16:21

RaduBerinde reviewed May 13, 2024

View reviewed changes

sumeerbhola requested a review from RaduBerinde May 14, 2024 16:48

sumeerbhola commented May 14, 2024

View reviewed changes

sumeerbhola force-pushed the bigger_read branch from ba1b9d3 to ca2072d Compare May 14, 2024 16:49

RaduBerinde approved these changes May 14, 2024

View reviewed changes

sumeerbhola force-pushed the bigger_read branch 2 times, most recently from fc743dd to 4c91ad1 Compare May 14, 2024 21:38

sumeerbhola requested a review from RaduBerinde May 14, 2024 21:38

sumeerbhola commented May 14, 2024

View reviewed changes

RaduBerinde approved these changes May 15, 2024

View reviewed changes

sumeerbhola requested a review from RaduBerinde May 15, 2024 16:40

sumeerbhola commented May 15, 2024

View reviewed changes

sumeerbhola force-pushed the bigger_read branch from 4c91ad1 to b98d087 Compare May 15, 2024 16:41

sumeerbhola merged commit c1cfad1 into cockroachdb:master May 15, 2024
11 checks passed

sumeerbhola mentioned this pull request May 16, 2024

sstable,objstorage: seed Reader with pre-stored metadata #2328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

objstorage,sstable: add read-before for reader creation and iter index/filter blocks #3608

objstorage,sstable: add read-before for reader creation and iter index/filter blocks #3608

sumeerbhola commented May 11, 2024

cockroach-teamcity commented May 11, 2024

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

objstorage,sstable: add read-before for reader creation and iter index/filter blocks #3608

objstorage,sstable: add read-before for reader creation and iter index/filter blocks #3608

Conversation

sumeerbhola commented May 11, 2024

cockroach-teamcity commented May 11, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment