Add lossy aggregator mode to reduce contention #282

rishabh · 2023-07-11T03:48:24Z

Background:
We heavily make use of the extended aggregation to buffer metric samples, however, we do see contention from time to time due to the bottleneck created by the buffer.

We don't want to use the channel mode since our channel would have to be quite big to ensure we're not dropping any samples. It would also have to be manually configured and tuned for each environment.

This PR introduces a new lossy aggregator mode for distributions, histograms, and timings.

How it works is fairly straightforward:

Use a sync.Pool to quickly grab a lossyBuffer with basically no contention.
Add the metric sample to the lossy buffer, without a lock, since it's guaranteed that the running goroutine has sole access to the buffer.
If the lossy buffer doesn't have enough samples, put the lossy buffer back into the pool.
If the lossy buffer has enough samples, flush the lossy buffer into the primary metric context buffer (the same one used in the mutex aggregator mode). This requires grabbing a lock.

Essentially, we now buffer the writes and then acquire a lock, flush quickly, and then release the lock. What makes it lossy is that if the lossy buffers sit in the pool for too long, they may be reaped by the garbage collector, which would cause us to lose all the samples in the lossy buffer. However, after doing some testing with real-world usage, we dropped less than 0.001% of metric samples.

I've also changed a few other things:

Use a metricContext type instead of a string for the buffered context maps. This is a good addition since we no longer need to concatenate metric names with tags and we don't need to concatenate (read: allocate) anything if there's 0 or 1 tags used.
Use fastrand to avoid any contention with rand. This is not a big deal since most users are probably using a 1 for the sample rate. It works by using the built-in fastrand function in runtime. For go <1.19, it requires making 2 calls to get 2 random uint32s to create a single random uint64. This is used internally for maps.

…d metrics

remeh · 2023-07-19T06:48:59Z

statsd/aggregator_test.go

-			gotContext, gotTags := getContextAndTags(test.name, test.tags)
-			assert.Equal(t, test.wantContext, gotContext)
-			assert.Equal(t, test.wantTags, gotTags)
+		b.Run(test.testName, func(b *testing.B) {


I would be curious to have results of both versions. This is a crucial part of the client performance, it might be interesting to write them in the PR description.
I don't think there will be any difference but I'm also wondering how's the hashing performing once in the map.

remeh · 2023-07-19T07:24:04Z

statsd/telemetry.go

+	// AggregationNbSample is the total number of samples flushed by the aggregator when either
+	// WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
+	AggregationNbSample uint64
+	// AggregationNbSampleHistogram is the total number of samples for histograms flushed by the aggregator when either
+	// WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
+	AggregationNbSampleHistogram uint64
+	// AggregationNbSampleDistribution is the total number of samples for distributions flushed by the aggregator when
+	// either WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
+	AggregationNbSampleDistribution uint64
+	// AggregationNbSampleTiming is the total number of samples for timings flushed by the aggregator when either
+	// WithClientSideAggregation or WithExtendedClientSideAggregation options are enabled.
+	AggregationNbSampleTiming uint64


It is a nice addition but it would be lovely if you can send it part of another PR: as it is something eventually used by the customers (we publicly maintain a client-side telemetry documentation and we'll have to have a follow-up task to document these new ones if we merge them), we try to have something consistent between the different clients and having a separate PR might help other clients implementation.

Ah, sorry, would you like me to make a separate PR to update the telemetry? I can't really separate the telemetry with the rest of the logic since this is how we're tracking whether or not samples were dropped for the benchmarks.

In all honesty, the best would be a separate PR for the telemetry and a separate one for the "rand" replacement.

The reasoning behind this is that they are the two things actively modifying the current behaviour of the library, while the lossy mode isn't really (it's fairly isolated because of the switch case).

If the telemetry can't be extracted to a different PR, it would still be preferable that the rand change were extracted to a separate PR.

Understood!

I'll get rid of the rand changes for now and introduce them in a separate, later PR.

remeh · 2023-07-19T08:52:31Z

statsd/statsd.go

+			return
+		}
+
+		m := pool.Get().(*lossyBuffer)


Isn't this creating a good risk of OOMs? The latency is basically gained here, where nothing would really block or make the client wait on every metric submission call, but isn't it at the cost of eventually creating a lot of lossy buffers here? Had you the chance to monitor/compare your app RAM usage while submitting the metrics? (there are different usage scenarios, yours may not be an issue here)

rishabh added 10 commits July 9, 2023 22:53

Add fast pseudo-random float64

5549474

Add lossy aggregator mode

0487b32

Clean up comments

a16c351

Make the pools local to the aggregator

a749342

Faster tag concatenation for more/larger tags

d2d4773

Fix tests related to telemetry changes

e994111

unexport lossy buffer receivers

827abf7

Benchmark both extended aggregated metrics and non-extended aggregate…

ba2b31d

…d metrics

Flush sample threshold should be configurable

624cd8d

Clean up comments

ec2ce20

remeh reviewed Jul 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lossy aggregator mode to reduce contention #282

Add lossy aggregator mode to reduce contention #282

rishabh commented Jul 11, 2023 •

edited

remeh Jul 19, 2023

remeh Jul 19, 2023

rishabh Jul 19, 2023

remeh Jul 20, 2023

rishabh Jul 20, 2023

remeh Jul 19, 2023

Add lossy aggregator mode to reduce contention #282

Are you sure you want to change the base?

Add lossy aggregator mode to reduce contention #282

Conversation

rishabh commented Jul 11, 2023 • edited

remeh Jul 19, 2023

Choose a reason for hiding this comment

remeh Jul 19, 2023

Choose a reason for hiding this comment

rishabh Jul 19, 2023

Choose a reason for hiding this comment

remeh Jul 20, 2023

Choose a reason for hiding this comment

rishabh Jul 20, 2023

Choose a reason for hiding this comment

remeh Jul 19, 2023

Choose a reason for hiding this comment

rishabh commented Jul 11, 2023 •

edited