New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src/histogram: Make Histogram::observe atomic across collects #314
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mxinden
changed the title
src/{histogram,atomic64}: Make Histogram::observe atomic across collects
src/histogram: Make Histogram::observe atomic across collects
Apr 5, 2020
If an observe and a collect operation interleave, the latter should not expose a snapshot of the histogram that does not uphold all histogram invariants. For example for the invariant that the overall observation counter should equal the sum of all bucket counters: Say that an `observe` increases the overall counter but before updating a specific bucket counter a collect operation snapshots the histogram. This commits adds a basic unit test to test that the above is not happening. Signed-off-by: Max Inden <mail@max-inden.de>
A histogram supports two main execution paths: 1. `observe` which increases the overall observation counter, updates the observation sum and increases a single bucket counter. 2. `proto` (aka. collecting the metric, from now on referred to as the collect operation) which snapshots the state of the histogram and exposes it as a Protobuf struct. If an observe and a collect operation interleave, the latter could be exposing a snapshot of the histogram that does not uphold all histogram invariants. For example for the invariant that the overall observation counter should equal the sum of all bucket counters: Say that an `observe` increases the overall counter but before updating a specific bucket counter a collect operation snapshots the histogram. This commits adjusts the `HistogramCore` implementation to make such race conditions impossible. It introduces the notion of shards, one hot shard for `observe` operations to record their observation and one cold shard for collect operations to collect a consistent snapshot of the histogram. `observe` operations hit the hot shard and record their observation. Collect operations switch hot and cold, wait for all `observe` calls to finish on the previously hot now cold shard and then expose the consistent snapshot. Signed-off-by: Max Inden <mail@max-inden.de>
Add a basic benchmark test which spawns 4 threads in the background continuously calling `observe` 1_000 times and then `collect`. At the same time call `observe` within the `Bencher::iter` closure to measure impact of background threads on `observe` call. Signed-off-by: Max Inden <mail@max-inden.de>
mxinden
force-pushed
the
atomic-histogram
branch
from
April 20, 2020 15:28
df3c64d
to
329c5d0
Compare
Signed-off-by: Max Inden <mail@max-inden.de>
Signed-off-by: Max Inden <mail@max-inden.de>
mxinden
force-pushed
the
atomic-histogram
branch
from
April 20, 2020 15:29
329c5d0
to
2faf0e0
Compare
Sorry I missed this PR .. 🧐 @BusyJay Do you have suggestions about the Mutex in this PR? |
No, the mutex seems simple and reasonable. |
Rusts drop semantics can be confusing sometimes. E.g. `let _ = l.lock()` would drop the lock guard immediately whereas `let _guard = l.lock()` would drop the guard in LIFO order at the end of the current scope. Instead of relying on the above guarantee with `let _guard`, drop the mutex guard explicitely hopefully making this less error prone in the future. Signed-off-by: Max Inden <mail@max-inden.de>
mxinden
force-pushed
the
atomic-histogram
branch
from
June 19, 2020 12:35
43a4a86
to
79a499f
Compare
lucab
reviewed
Jun 19, 2020
Signed-off-by: Max Inden <mail@max-inden.de>
lucab
reviewed
Jul 9, 2020
Signed-off-by: Max Inden <mail@max-inden.de>
Signed-off-by: Max Inden <mail@max-inden.de>
Signed-off-by: Max Inden <mail@max-inden.de>
Signed-off-by: Max Inden <mail@max-inden.de>
Thanks for the review @lucab. Would you mind taking another look? |
lucab
approved these changes
Jul 14, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
A histogram supports two main execution paths:
observe
which increases the overall observation counter, updates the observation sum and increases a single bucket counter.collect
which snapshots the state of the histogram and exposes it as a Protobuf struct.If an observe and a collect operation interleave, the latter could be exposing a snapshot of the histogram that does not uphold all histogram invariants. For example for the invariant that the overall observation counter should equal the sum of all bucket counters: Say that an
observe
increases the overall counter but before updating a specific bucket counter acollect
operation snapshots the histogram.The above race condition has been solved in the Golang Prometheus client with prometheus/client_golang#457 by introducing the notion of shards, one hot shard for
observe
operations to record their observation and one cold shard for collect operations to collect a consistent snapshot of the histogram.observe
operations hit the hot shard and record their observation. Collect operations switch hot and cold, wait for allobserve
calls to finish on the previously hot now cold shard and then expose the consistent snapshot.This pull request ports prometheus/client_golang#457 to the Rust Prometheus client.
Content of the pull request
The pull request contains three commits:
src/histogram: Add test ensuring Histogram::observe is atomic
Showcasing the above race condition in the current imlementation.
src/{histogram,atomic64}: Make Histogram::observe atomic across collects
Porting Lock-free atomic observations in Histograms! prometheus/client_golang#457 to fix the race
condition.
benches/histogram: Add benchmark for concurrent observe and collect
Adding a benchmark to show the impact of the patch. While the benchmark does not show a performance impact through the patch on my laptop, I am happy to test this more thoroughly on a larger machine (128 cores) in case there is general interest to accept this patch.
Greater picture
Fixing this race condition is especially attractive now that with Prometheus
v2.17.0
the isolation level has been increased (See changelog entry below and prometheus/prometheus#6841).Trade-off
While this pull request fixes the above described race condition it does increase complexity:
Introduction of the notion of shards.
The
observe
code path is mostly untouched other than one additional atomic operation and increasedOrdering
levels and thus stays lock-free.collect
operations need to happen sequentially (enforced through a single Mutex). A singlecollect
and multipleobserve
operations can still operate concurrently. Given that thecollect
operation should happen rarely (> 1s) this should not introduce a performance impact.In order to coordinate hot and cold shards the 64 bit histogram counter is split into a 1 bit shard index and 63 bit counter. Thus the amount of observations a histogram can record is divided by two. While this might sound like an issue, one could still record one observation per milisecond for 292_277_266 years.
I hope the above description makes sense. Let me know if this is something you are willing to accept into
master
.Thanks a bunch for maintaining this library!