-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a buffered paginated store implementation #31
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty great :) Some comments, nothing significant. I suggested some comments for a couple of functions - I do that to help me understand the code, feel free to ignore or use!
} | ||
} | ||
|
||
func (s *BufferedPaginatedStore) Bins() <-chan Bin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this approach may leak if the calling routine doesn't complete: https://stackoverflow.com/a/12896013
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems totally reasonable, thanks!
86fcaf3
to
8ce015b
Compare
Description
BufferedPaginatedStore
allocates storage for counts in aligned fixed-size pages, themselves stored in a dynamically-sized slice. A page encodes the counts for a contiguous range of indexes, and two pages that are contiguous in the slice encode ranges that are contiguous. In addition, input indexes that are added to the store with a count equal to 1 can be stored in a buffer.The store favors using the buffer and only creates pages when the memory size of the page is no greater than the memory space that is needed to keep in the buffer the indexes that could otherwise be encoded in that page. That means that some indexes may stay indefinitely in the buffer if, to be removed from the buffer, they would create a page that is almost empty. The process that transfers indexes from the buffer to pages is called compaction. This store never collapses or merges bins, therefore, it does not introduce any error in itself. In particular,
MinIndex()
,MaxIndex()
,Bins()
andKeyAtRank()
return exact results.There is no upper bound on the memory size that this store needs to encode input indexes, and some input data distributions may make it reach large sizes. However, thanks to the buffer and the fact that only required pages are allocated, it can be much more space efficient than alternative stores, especially dense stores, in various situations, including when only few indexes are added (with their counts equal to 1), when the input data has a few outliers or when the input data distribution is multimodal.
Benchmarks
I benchmarked this store against the dense store (backed by a single count slice), the collapsing dense store (which limits the length of the count slice, hence introduces errors) and the sparse store (backed by a map). Given that the buffer can only be used when adding indexes with
count == 1
, I also benchmarked the buffered paginated store when adding indexes with non-integer counts (buffered_paginated_non_int
).The input size is the number of times
Add
orAddWithCount
is called. The indexes are generated randomly, following a spread out normal distribution (simulating a lognormal distribution for input values of the sketch). The spread of the data may not represent real use cases, so I wouldn't pay too much attention to absolute figures. The main point is to see differences in behavior with other stores.Memory size
This is estimated in two ways (except for the sparse store), which both give similar figures. See
TestBenchmarkSize
in the code.The buffered paginated store is much smaller in memory for small input sizes. Its size is roughly the same the dense store's size for large cardinalities.
Allocated size
Because it avoids reallocating memory for already in-use ranges of indexes, it allocates much less memory than other stores, most notably the dense store.
Amortized
Add
durationThose durations exclude the time necessary to generate input values.
The bump around 1000 for the buffered paginated store is likely due to the fact that it is about when the density of values reaches the level when it becomes more interesting (in terms of memory size) to create pages instead of using the buffer. Once pages are created, it gets faster again.
Left to do / possible improvements
pageLen
.