Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lossy aggregator mode to reduce contention #282

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

rishabh
Copy link

@rishabh rishabh commented Jul 11, 2023

Background:
We heavily make use of the extended aggregation to buffer metric samples, however, we do see contention from time to time due to the bottleneck created by the buffer.

We don't want to use the channel mode since our channel would have to be quite big to ensure we're not dropping any samples. It would also have to be manually configured and tuned for each environment.

This PR introduces a new lossy aggregator mode for distributions, histograms, and timings.


How it works is fairly straightforward:

  1. Use a sync.Pool to quickly grab a lossyBuffer with basically no contention.
  2. Add the metric sample to the lossy buffer, without a lock, since it's guaranteed that the running goroutine has sole access to the buffer.
  3. If the lossy buffer doesn't have enough samples, put the lossy buffer back into the pool.
  4. If the lossy buffer has enough samples, flush the lossy buffer into the primary metric context buffer (the same one used in the mutex aggregator mode). This requires grabbing a lock.

Essentially, we now buffer the writes and then acquire a lock, flush quickly, and then release the lock. What makes it lossy is that if the lossy buffers sit in the pool for too long, they may be reaped by the garbage collector, which would cause us to lose all the samples in the lossy buffer. However, after doing some testing with real-world usage, we dropped less than 0.001% of metric samples.


I've also changed a few other things:

  1. Use a metricContext type instead of a string for the buffered context maps. This is a good addition since we no longer need to concatenate metric names with tags and we don't need to concatenate (read: allocate) anything if there's 0 or 1 tags used.
  2. Use fastrand to avoid any contention with rand. This is not a big deal since most users are probably using a 1 for the sample rate. It works by using the built-in fastrand function in runtime. For go <1.19, it requires making 2 calls to get 2 random uint32s to create a single random uint64. This is used internally for maps.

gotContext, gotTags := getContextAndTags(test.name, test.tags)
assert.Equal(t, test.wantContext, gotContext)
assert.Equal(t, test.wantTags, gotTags)
b.Run(test.testName, func(b *testing.B) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be curious to have results of both versions. This is a crucial part of the client performance, it might be interesting to write them in the PR description.
I don't think there will be any difference but I'm also wondering how's the hashing performing once in the map.

Comment on lines +113 to +124
// AggregationNbSample is the total number of samples flushed by the aggregator when either
// WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
AggregationNbSample uint64
// AggregationNbSampleHistogram is the total number of samples for histograms flushed by the aggregator when either
// WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
AggregationNbSampleHistogram uint64
// AggregationNbSampleDistribution is the total number of samples for distributions flushed by the aggregator when
// either WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
AggregationNbSampleDistribution uint64
// AggregationNbSampleTiming is the total number of samples for timings flushed by the aggregator when either
// WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
AggregationNbSampleTiming uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a nice addition but it would be lovely if you can send it part of another PR: as it is something eventually used by the customers (we publicly maintain a client-side telemetry documentation and we'll have to have a follow-up task to document these new ones if we merge them), we try to have something consistent between the different clients and having a separate PR might help other clients implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, would you like me to make a separate PR to update the telemetry? I can't really separate the telemetry with the rest of the logic since this is how we're tracking whether or not samples were dropped for the benchmarks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all honesty, the best would be a separate PR for the telemetry and a separate one for the "rand" replacement.

The reasoning behind this is that they are the two things actively modifying the current behaviour of the library, while the lossy mode isn't really (it's fairly isolated because of the switch case).

If the telemetry can't be extracted to a different PR, it would still be preferable that the rand change were extracted to a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood!

I'll get rid of the rand changes for now and introduce them in a separate, later PR.

return
}

m := pool.Get().(*lossyBuffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this creating a good risk of OOMs? The latency is basically gained here, where nothing would really block or make the client wait on every metric submission call, but isn't it at the cost of eventually creating a lot of lossy buffers here? Had you the chance to monitor/compare your app RAM usage while submitting the metrics? (there are different usage scenarios, yours may not be an issue here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants