Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

objstorage,sstable: add read-before for reader creation and iter index/filter blocks #3608

Merged
merged 1 commit into from May 15, 2024

Conversation

sumeerbhola
Copy link
Collaborator

We consider two use cases for read-before when using a
objstorageprovider.remoteReadable, given the high latency (and cost) of
each read operation:

  • When a sstable.Reader is opened, it needs to read the footer, metaindex
    block and meta properties block. It starts by reading the footer which is
    at the end of the table and then proceeds to read the other two. Instead
    of doing 3 tiny reads, we would like to do one read.

  • When a single-level or two-level iterator is opened, it reads the
    (top-level) index block first. When the iterator is used, it will
    typically follow this by reading the filter block (since SeeKPrefixGE is
    common in CockroachDB). For a two-level iterator it will also read the
    lower-level index blocks which are after the filter block and before the
    top-level index block. It would be ideal if reading the top-level index
    block read enough to include the filter block. And for two-level
    iterators this would also include the lower-level index blocks.

In both use-cases we want the first read from the remoteReadable to do a
larger read, and read bytes earlier than the requested read, hence
"read-before". Subsequent reads from the remoteReadable can use the usual
readahead logic (for the second use-case above, this can help with
sequential reads of the lower-level index blocks when the read-before was
insufficient to satisfy such reads).

Since remoteReadHandle already has a buffer for read-ahead, we utilize
it for this read-before buffering pattern too.

While here, we bump up the read-ahead size for compactions to 8MB, given
the lower concurrency (and thereby higher tolerance to memory usage).

Informs #2328

@sumeerbhola sumeerbhola requested a review from a team as a code owner May 11, 2024 02:22
@sumeerbhola sumeerbhola requested a review from itsbilal May 11, 2024 02:22
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First commit is from #3607.

I'll add a test if the approach looks ok.

Reviewable status: 0 of 19 files reviewed, all discussions resolved (waiting on @itsbilal)

@sumeerbhola sumeerbhola force-pushed the bigger_read branch 2 times, most recently from 1bc0866 to ba1b9d3 Compare May 12, 2024 16:21
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool idea, I think it's worth doing it.

Reviewable status: 0 of 20 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


objstorage/objstorage.go line 41 at r2 (raw file):

	//
	// Multiple separate ReadHandles can be used.
	NewReadHandle(ctx context.Context, readBeforeSize ReadBeforeSize) ReadHandle

I wonder if this would be cleaner (and less intrusive) as a new method on ReadHandle, where it advises what happens on the next read. It would also be more flexible, where we could use it on more than the firs tread.


objstorage/objstorage.go line 48 at r2 (raw file):

// just a suggestion that the callee can ignore (and does ignore in
// fileReadable).
type ReadBeforeSize int64

The semantics of this seem pretty clear here but in the implementation this is not the number of bytes we read before, but the total size of the extended read. We should document the semantics correctly here. Perhaps also rename to LeftExtendedReadSize.

Or perhaps reframe it as an "area of interest" with an offset and a length (especially if we switch to a method on the ReadHandle).


objstorage/objstorage.go line 56 at r2 (raw file):

	// metaindex, properties. 32KB is unnecessarily large, but it is still small
	// when considering remote object storage.
	ReadBeforeForNewReader = 32 << 10

[nit] I find 32 * 1024 easier to read


objstorage/objstorageprovider/remote_readable.go line 176 at r2 (raw file):

	// Also, since this is the first call, the buffer must be empty.
	if readBeforeSize > 0 {
		r.buffered.offset = offset - int64(readBeforeSize-len(p))

Since the exact size is speculative and we have some leeway, perhaps we should align the offset to a 4kb boundary?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 0 of 20 files reviewed, 3 unresolved discussions (waiting on @itsbilal and @RaduBerinde)


objstorage/objstorage.go line 41 at r2 (raw file):

Previously, RaduBerinde wrote…

I wonder if this would be cleaner (and less intrusive) as a new method on ReadHandle, where it advises what happens on the next read. It would also be more flexible, where we could use it on more than the firs tread.

Good point. I considered it, but the difficulty is that the logic of remoteReadHandle.ReadAt becomes more complicated when doing this for a later read. We need to then account for how much of the read is buffered, and if both read-ahead and read-before is active, which one to prefer, or do both. And there is no use case that guides us on what the behavior ought to be. With the first read, we know there is no read-ahead and nothing in the buffer. We can definitely extend this when there is a use-case, but for these reasons I've left this unchanged for now.


objstorage/objstorage.go line 48 at r2 (raw file):

Previously, RaduBerinde wrote…

The semantics of this seem pretty clear here but in the implementation this is not the number of bytes we read before, but the total size of the extended read. We should document the semantics correctly here. Perhaps also rename to LeftExtendedReadSize.

Or perhaps reframe it as an "area of interest" with an offset and a length (especially if we switch to a method on the ReadHandle).

I added the following comment since what is happening here is quite akin to how readahead works, at least in remoteReadHandle.

// When 0, the first read will only read what it is asked to read, say n
// bytes. When it is a value b > 0, if b > n, then the read will be padded by
// an additional b-n bytes to the left, resulting in an overall read size of
// b. This behavior is akin to what the read-ahead implementation does -- when
// the n bytes are not buffered, and there is read-ahead of b > n, the read
// length is b bytes.

I didn't change the name because of symmetry with readahead and some similarity.


objstorage/objstorage.go line 56 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I find 32 * 1024 easier to read

Done


objstorage/objstorageprovider/remote_readable.go line 176 at r2 (raw file):

Previously, RaduBerinde wrote…

Since the exact size is speculative and we have some leeway, perhaps we should align the offset to a 4kb boundary?

ext4 block size is 4KB, and also the sector size for typical disks. But this is remote blob storage, which typically packs blobs into the same file, and has unknown alignment.

Our configured readBeforeSizes are multiples of 4KB and effective readBeforeSize will likely be the same (since the offset will typically be large compared to the configured readBeforeSize).

So I don't quite understand the benefit. Could you elaborate?

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 20 files reviewed, 3 unresolved discussions (waiting on @itsbilal and @sumeerbhola)


objstorage/objstorage.go line 41 at r2 (raw file):

Previously, sumeerbhola wrote…

Good point. I considered it, but the difficulty is that the logic of remoteReadHandle.ReadAt becomes more complicated when doing this for a later read. We need to then account for how much of the read is buffered, and if both read-ahead and read-before is active, which one to prefer, or do both. And there is no use case that guides us on what the behavior ought to be. With the first read, we know there is no read-ahead and nothing in the buffer. We can definitely extend this when there is a use-case, but for these reasons I've left this unchanged for now.

Got it. We already have a method SetupForCompaction() that reconfigures the handle. We should be consistent and either 1) also use a method for read-before (with the specification that it can only be called before the handle is used) or 2) add a ReadHandleOptions struct and move "for compaction" to that.


objstorage/objstorageprovider/remote_readable.go line 163 at r2 (raw file):

// ReadAt is part of the objstorage.ReadHandle interface.
func (r *remoteReadHandle) ReadAt(ctx context.Context, p []byte, offset int64) error {
	readBeforeSize := int(r.readBeforeSize)

[nit] It would be easier if we calculated exactly how many extra bytes we're reading:

var extraBytesBefore int64
if r.readBeforeSize > 0 {
  if r.readBeforeSize > len(p) {
    extraBytesBefore = min(int64(r.readBeforeSize - len(p)), offset)
  }
  r.readBeforeSize = 0
}

objstorage/objstorageprovider/remote_readable.go line 176 at r2 (raw file):

Previously, sumeerbhola wrote…

ext4 block size is 4KB, and also the sector size for typical disks. But this is remote blob storage, which typically packs blobs into the same file, and has unknown alignment.

Our configured readBeforeSizes are multiples of 4KB and effective readBeforeSize will likely be the same (since the offset will typically be large compared to the configured readBeforeSize).

So I don't quite understand the benefit. Could you elaborate?

Sorry, I guess I spaced out and was thinking of regular storage


objstorage/objstorageprovider/remote_readable.go line 182 at r2 (raw file):

		}
		// TODO(radu): we need to somehow account for this memory.
		if cap(r.buffered.data) >= readBeforeSize {

[nit] I'd move this block to a helper function readToBuffer(offset, length) and use it below as well

@sumeerbhola sumeerbhola force-pushed the bigger_read branch 2 times, most recently from fc743dd to 4c91ad1 Compare May 14, 2024 21:38
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is ready.

Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @RaduBerinde)


objstorage/objstorage.go line 41 at r2 (raw file):

Previously, RaduBerinde wrote…

Got it. We already have a method SetupForCompaction() that reconfigures the handle. We should be consistent and either 1) also use a method for read-before (with the specification that it can only be called before the handle is used) or 2) add a ReadHandleOptions struct and move "for compaction" to that.

I've added a TODO

// TODO(sumeer): both readBeforeSize and ReadHandle.SetupForCompaction are
// initial configuration of a ReadHandle. So they should both be passed as
// Options to NewReadHandle. But currently the latter is a separate method.
// This is because of how the sstable.Reader calls setupForCompaction on the
// iterators after they are constructed. Consider fixing this oddity.

objstorage/objstorageprovider/remote_readable.go line 163 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] It would be easier if we calculated exactly how many extra bytes we're reading:

var extraBytesBefore int64
if r.readBeforeSize > 0 {
  if r.readBeforeSize > len(p) {
    extraBytesBefore = min(int64(r.readBeforeSize - len(p)), offset)
  }
  r.readBeforeSize = 0
}

Done


objstorage/objstorageprovider/remote_readable.go line 182 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I'd move this block to a helper function readToBuffer(offset, length) and use it below as well

Done

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @sumeerbhola)


objstorage/objstorageprovider/remote_readable_test.go line 25 at r4 (raw file):

}

func (r *testObjectReader) init(size int) {

[nit] I think we could have used a logging in mem remote storage for this. No need to change anything heree, just mentioning it for the future.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 22 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @RaduBerinde)


objstorage/objstorageprovider/remote_readable_test.go line 25 at r4 (raw file):

Previously, RaduBerinde wrote…

[nit] I think we could have used a logging in mem remote storage for this. No need to change anything heree, just mentioning it for the future.

Thanks for the pointer.

…x/filter blocks

We consider two use cases for read-before when using a
objstorageprovider.remoteReadable, given the high latency (and cost) of
each read operation:

- When a sstable.Reader is opened, it needs to read the footer, metaindex
  block and meta properties block. It starts by reading the footer which is
  at the end of the table and then proceeds to read the other two. Instead
  of doing 3 tiny reads, we would like to do one read.

- When a single-level or two-level iterator is opened, it reads the
  (top-level) index block first. When the iterator is used, it will
  typically follow this by reading the filter block (since SeeKPrefixGE is
  common in CockroachDB). For a two-level iterator it will also read the
  lower-level index blocks which are after the filter block and before the
  top-level index block. It would be ideal if reading the top-level index
  block read enough to include the filter block. And for two-level
  iterators this would also include the lower-level index blocks.

In both use-cases we want the first read from the remoteReadable to do a
larger read, and read bytes earlier than the requested read, hence
"read-before". Subsequent reads from the remoteReadable can use the usual
readahead logic (for the second use-case above, this can help with
sequential reads of the lower-level index blocks when the read-before was
insufficient to satisfy such reads).

Since remoteReadHandle already has a buffer for read-ahead, we utilize
it for this read-before buffering pattern too.

While here, we bump up the read-ahead size for compactions to 8MB, given
the lower concurrency (and thereby higher tolerance to memory usage).

Informs cockroachdb#2328
@sumeerbhola sumeerbhola merged commit c1cfad1 into cockroachdb:master May 15, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants