Add a buffered paginated store implementation #31

CharlesMasson · 2021-05-20T17:28:37Z

Description

BufferedPaginatedStore allocates storage for counts in aligned fixed-size pages, themselves stored in a dynamically-sized slice. A page encodes the counts for a contiguous range of indexes, and two pages that are contiguous in the slice encode ranges that are contiguous. In addition, input indexes that are added to the store with a count equal to 1 can be stored in a buffer.
The store favors using the buffer and only creates pages when the memory size of the page is no greater than the memory space that is needed to keep in the buffer the indexes that could otherwise be encoded in that page. That means that some indexes may stay indefinitely in the buffer if, to be removed from the buffer, they would create a page that is almost empty. The process that transfers indexes from the buffer to pages is called compaction. This store never collapses or merges bins, therefore, it does not introduce any error in itself. In particular, MinIndex(), MaxIndex(), Bins() and KeyAtRank() return exact results.
There is no upper bound on the memory size that this store needs to encode input indexes, and some input data distributions may make it reach large sizes. However, thanks to the buffer and the fact that only required pages are allocated, it can be much more space efficient than alternative stores, especially dense stores, in various situations, including when only few indexes are added (with their counts equal to 1), when the input data has a few outliers or when the input data distribution is multimodal.

Benchmarks

I benchmarked this store against the dense store (backed by a single count slice), the collapsing dense store (which limits the length of the count slice, hence introduces errors) and the sparse store (backed by a map). Given that the buffer can only be used when adding indexes with count == 1, I also benchmarked the buffered paginated store when adding indexes with non-integer counts (buffered_paginated_non_int).

The input size is the number of times Add or AddWithCount is called. The indexes are generated randomly, following a spread out normal distribution (simulating a lognormal distribution for input values of the sketch). The spread of the data may not represent real use cases, so I wouldn't pay too much attention to absolute figures. The main point is to see differences in behavior with other stores.

Memory size

This is estimated in two ways (except for the sparse store), which both give similar figures. See TestBenchmarkSize in the code.

The buffered paginated store is much smaller in memory for small input sizes. Its size is roughly the same the dense store's size for large cardinalities.

Allocated size

Because it avoids reallocating memory for already in-use ranges of indexes, it allocates much less memory than other stores, most notably the dense store.

Amortized `Add` duration

Those durations exclude the time necessary to generate input values.

The bump around 1000 for the buffered paginated store is likely due to the fact that it is about when the density of values reaches the level when it becomes more interesting (in terms of memory size) to create pages instead of using the buffer. Once pages are created, it gets faster again.

Left to do / possible improvements

Tune pageLen.
Make serialization/deserialization more performant.
Benchmark merging, make it more performant if needed.
Consider allowing capping the memory size (by collapsing on one side or uniformly, that is, merging contiguous buckets).

heyronhay

Looks pretty great :) Some comments, nothing significant. I suggested some comments for a couple of functions - I do that to help me understand the code, feel free to ignore or use!

ddsketch/store/buffered_paginated.go

heyronhay · 2021-06-07T15:56:31Z

ddsketch/store/buffered_paginated.go

+	}
+}
+
+func (s *BufferedPaginatedStore) Bins() <-chan Bin {


Looks like this approach may leak if the calling routine doesn't complete: https://stackoverflow.com/a/12896013

Yes indeed. However, this is part of the Store interface, so we cannot remove it easily. If that's fine, I'll see whether we can remove it in a separate PR, and use the newly added ForEach instead. Meanwhile, I added a comment in 86fcaf3.

Yeah, that seems totally reasonable, thanks!

ddsketch/store/buffered_paginated.go

Base automatically changed from cmasson/sparse_store to master June 3, 2021 09:31

heyronhay approved these changes Jun 7, 2021

View reviewed changes

CharlesMasson added 7 commits June 14, 2021 18:35

Add a buffered paginated store implementation

b4ad599

Add benchmarks for adding methods

0f39d3f

Add basic benchmarks for memory size

e36e9b5

Add comments to document index functions

f8cf915

Remove unnecessary conversion

1b040df

Add comment about Bins function

cd941a7

Add ForEach

8ce015b

CharlesMasson force-pushed the cmasson/buffered_paginated_store branch from 86fcaf3 to 8ce015b Compare June 14, 2021 16:53

CharlesMasson merged commit 13cd53c into master Jun 15, 2021

CharlesMasson deleted the cmasson/buffered_paginated_store branch June 15, 2021 07:47

CharlesMasson mentioned this pull request Jun 29, 2021

Referendum on Histogram format open-telemetry/opentelemetry-specification#1776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a buffered paginated store implementation #31

Add a buffered paginated store implementation #31

CharlesMasson commented May 20, 2021 •

edited

heyronhay left a comment

heyronhay Jun 7, 2021

CharlesMasson Jun 14, 2021

heyronhay Jun 14, 2021

Add a buffered paginated store implementation #31

Add a buffered paginated store implementation #31

Conversation

CharlesMasson commented May 20, 2021 • edited

Description

Benchmarks

Memory size

Allocated size

Amortized Add duration

Left to do / possible improvements

heyronhay left a comment

Choose a reason for hiding this comment

heyronhay Jun 7, 2021

Choose a reason for hiding this comment

CharlesMasson Jun 14, 2021

Choose a reason for hiding this comment

heyronhay Jun 14, 2021

Choose a reason for hiding this comment

CharlesMasson commented May 20, 2021 •

edited

Amortized `Add` duration