Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a buffered paginated store implementation #31

Merged
merged 7 commits into from
Jun 15, 2021

Conversation

CharlesMasson
Copy link
Contributor

@CharlesMasson CharlesMasson commented May 20, 2021

Description

BufferedPaginatedStore allocates storage for counts in aligned fixed-size pages, themselves stored in a dynamically-sized slice. A page encodes the counts for a contiguous range of indexes, and two pages that are contiguous in the slice encode ranges that are contiguous. In addition, input indexes that are added to the store with a count equal to 1 can be stored in a buffer.
The store favors using the buffer and only creates pages when the memory size of the page is no greater than the memory space that is needed to keep in the buffer the indexes that could otherwise be encoded in that page. That means that some indexes may stay indefinitely in the buffer if, to be removed from the buffer, they would create a page that is almost empty. The process that transfers indexes from the buffer to pages is called compaction. This store never collapses or merges bins, therefore, it does not introduce any error in itself. In particular, MinIndex(), MaxIndex(), Bins() and KeyAtRank() return exact results.
There is no upper bound on the memory size that this store needs to encode input indexes, and some input data distributions may make it reach large sizes. However, thanks to the buffer and the fact that only required pages are allocated, it can be much more space efficient than alternative stores, especially dense stores, in various situations, including when only few indexes are added (with their counts equal to 1), when the input data has a few outliers or when the input data distribution is multimodal.

Benchmarks

I benchmarked this store against the dense store (backed by a single count slice), the collapsing dense store (which limits the length of the count slice, hence introduces errors) and the sparse store (backed by a map). Given that the buffer can only be used when adding indexes with count == 1, I also benchmarked the buffered paginated store when adding indexes with non-integer counts (buffered_paginated_non_int).

The input size is the number of times Add or AddWithCount is called. The indexes are generated randomly, following a spread out normal distribution (simulating a lognormal distribution for input values of the sketch). The spread of the data may not represent real use cases, so I wouldn't pay too much attention to absolute figures. The main point is to see differences in behavior with other stores.

Memory size

This is estimated in two ways (except for the sparse store), which both give similar figures. See TestBenchmarkSize in the code.

newplot

The buffered paginated store is much smaller in memory for small input sizes. Its size is roughly the same the dense store's size for large cardinalities.

Allocated size

newplot (3)

Because it avoids reallocating memory for already in-use ranges of indexes, it allocates much less memory than other stores, most notably the dense store.

Amortized Add duration

Those durations exclude the time necessary to generate input values.

newplot (4)

The bump around 1000 for the buffered paginated store is likely due to the fact that it is about when the density of values reaches the level when it becomes more interesting (in terms of memory size) to create pages instead of using the buffer. Once pages are created, it gets faster again.

Left to do / possible improvements

  • Tune pageLen.
  • Make serialization/deserialization more performant.
  • Benchmark merging, make it more performant if needed.
  • Consider allowing capping the memory size (by collapsing on one side or uniformly, that is, merging contiguous buckets).

Base automatically changed from cmasson/sparse_store to master June 3, 2021 09:31
Copy link

@heyronhay heyronhay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty great :) Some comments, nothing significant. I suggested some comments for a couple of functions - I do that to help me understand the code, feel free to ignore or use!

ddsketch/store/buffered_paginated.go Show resolved Hide resolved
ddsketch/store/buffered_paginated.go Show resolved Hide resolved
ddsketch/store/buffered_paginated.go Show resolved Hide resolved
}
}

func (s *BufferedPaginatedStore) Bins() <-chan Bin {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this approach may leak if the calling routine doesn't complete: https://stackoverflow.com/a/12896013

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. However, this is part of the Store interface, so we cannot remove it easily. If that's fine, I'll see whether we can remove it in a separate PR, and use the newly added ForEach instead. Meanwhile, I added a comment in 86fcaf3.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems totally reasonable, thanks!

ddsketch/store/buffered_paginated.go Outdated Show resolved Hide resolved
@CharlesMasson CharlesMasson force-pushed the cmasson/buffered_paginated_store branch from 86fcaf3 to 8ce015b Compare June 14, 2021 16:53
@CharlesMasson CharlesMasson merged commit 13cd53c into master Jun 15, 2021
@CharlesMasson CharlesMasson deleted the cmasson/buffered_paginated_store branch June 15, 2021 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants