Statsd meter recordings are dropped when submitted from parallel threads #2880

hdv · 2021-12-01T07:10:15Z

Describe the bug
The following test fails when added to StatsdMeterRegistryPublishTest:

    @ParameterizedTest
    @EnumSource(StatsdProtocol.class)
    void receiveParallelMetricsSuccessfully(StatsdProtocol protocol) throws InterruptedException {
        final int N = 10;

        skipUdsTestOnWindows(protocol);
        serverLatch = new CountDownLatch(N);
        server = startServer(protocol, 0);
        final int port = getPort(protocol);

        meterRegistry = new StatsdMeterRegistry(getUnbufferedConfig(protocol, port), Clock.SYSTEM);
        startRegistryAndWaitForClient();
        Counter counter = Counter.builder("my.counter").register(meterRegistry);

        IntStream.range(0, N)
                .parallel()
                .forEach(ignored -> counter.increment());

        assertThat(serverLatch.await(3, TimeUnit.SECONDS)).isTrue();
    }

It succeeds when .parallel() line is commented out.

Environment

Micrometer version [e.g. 1.7.1]: latest Micrometer from main branch
Micrometer registry [e.g. prometheus]: statsd
OS: [e.g. macOS] macOS
Java version: [e.g. output of java -version]
~/projects/micrometer #v1.8.0 *3 !1 ❯ java --version  base
openjdk 11.0.2 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

To Reproduce
How to reproduce the bug:
See the test above

Expected behavior
Test should succeed

Additional context

Similar test succeeds for 1.7.6 release
Looks like this change might be related: Upgrade to Reactor 2020.0 #2588
increasing N to 1000 makes the test fail even without .parallel()

The text was updated successfully, but these errors were encountered:

hdv · 2021-12-01T23:06:04Z

Looks like the following change in StatsdCounter will make make things work for N=10:

    @Override
    public void increment(double amount) {
        if (!shutdown && amount > 0) {
            count.add(amount);
            
            // sink.tryEmitNext(lineBuilder.count((long) amount));

            sink.emitNext(lineBuilder.count((long) amount), (signal, failure) -> {
                System.out.println(failure);
                return failure == Sinks.EmitResult.FAIL_NON_SERIALIZED;
            });
    }

Problem is that we try to emit events to a sink from different threads and the current behavior of Sinks.many() sinks is to fail fast if events are not serialized. Apparently it was different with Processors before.
The danger of the busy-loop approach above is that it might increase CPU load as it will produce lots of retries.

This is all based on https://stackoverflow.com/questions/65379026/reactor-handle-fail-non-serialized-error-from-a-many-sink?noredirect=1&lq=1
Also see reactor/reactor-core#2382

hdv · 2021-12-01T23:08:32Z

This discovery was a little problematic as LogbackMetricsSuppressingManySink is masking the problem :-(

    @Override
    public Sinks.EmitResult tryEmitNext(String s) {
        LogbackMetrics.ignoreMetrics(() -> delegate.tryEmitNext(s));
        // We want to silently drop the element if not emitted for any reason
        // We do not use the returned result
        return Sinks.EmitResult.OK;
    }

It claims that everything is great while it isn't. And doesn't even debug log anything.
It should probably be smarter about it.

hdv · 2021-12-01T23:57:15Z

Actually, aforementioned busy-loop solution doesn't really work if I try to set N to 1000 in a test. Too much busy work is happening.

shakuzen · 2021-12-02T03:24:23Z

Thank you for the detailed investigation. That is very helpful. We will discuss with the Reactor team how we should handle this.

hdv · 2021-12-02T16:48:24Z

Noticed the title change.
Just to make sure we're on the same page, the problem doesn't seem to be specific to counters but rather the way samples are fed to sink. All StatsD meters are affected.

simonbasle · 2021-12-03T18:07:24Z

can you try this as a workaround: change the underlying Sinks.Many<T> flavor to be:

Sinks.unsafe().many().unicast().onBackpressureBuffer(
    Queues.unboundedMultiproducer().get());

If I understand correctly, the sink is already subscribed to only once so Unicast is an option.

The above code will create an "unsafe" implementation of the sink, meaning there will be no attempt at detecting concurrent usage. Instead, it will rely on the MPSC properties of the unboundedMultiproducer queue for serialization.

jonatan-ivanov · 2021-12-03T23:31:12Z

For the sake of documentation:

unicast does not work since we are re-subscribing in case of connection issues
I tried Simon's SinksMultiproducer see 1eedca4
It seems it works (you need to build reactor-core locally and make MultiproducerEmitResult public)
See connected reactor issue: Explore support for multiproducer Sinks.Many with heavy contention reactor/reactor-core#2850

until we figure out what to do. fixes micrometer-metricsgh-2880

related: micrometer-metricsgh-2880 Co-Authored-By: Denis Khitrik <1179235+hdv@users.noreply.github.com>

jonatan-ivanov · 2021-12-04T00:44:33Z

I moved the code back to the deprecated DirectProcessor. Merging this until we find out what to do can be an option: main...jonatan-ivanov:revert-to-use-processor

shakuzen · 2021-12-06T15:37:50Z

I've updated the 1.8.0 release notes calling this out as a known issue at the top. We'll work on getting a release out in the next day or two with the fix going back to the DirectProcessor for now. Once we get a snapshot published, could you please try that out and let us know if it looks fixed on your end @hdv?

hdv · 2021-12-06T16:11:30Z

Yes, I can do that. Thank you!

shakuzen · 2021-12-07T04:51:57Z

The 1.8.1-SNAPSHOT now contains the change reverting to the deprecated API. Could you please let us know if it looks alright before we do the release @hdv? Thanks for reporting the issue and all the investigation including a test case. Much appreciated and sorry for the regression.

hdv · 2021-12-07T21:51:39Z

@shakuzen I'll report back shortly after I try it in a live test system.

However, I'd like to point one more thing out. If we set N equal to 1000 in the test, it still fails for both parallel and sequential cases. Any idea why that could happen?

hdv · 2021-12-07T21:53:29Z

Also I noticed that TcpClient batches samples disregarding the buffering setting. Is that intentional?

hdv · 2021-12-07T22:33:15Z

@shakuzen I can see that 1.8.1-SNAPSHOT behavior is back on par with 1.7.6! Confirmed in my test environment.

Please note, the questions above are still standing.

jonatan-ivanov · 2021-12-07T22:59:07Z

@hdv Thank you very much confirming it!
I'm going to release a new patch round today to fix the most pressing/apparent issue and we will look into the other issues around this after that.

jonatan-ivanov · 2021-12-08T01:24:35Z

@hdv New patch releases are out, you can use a release version now: https://github.com/micrometer-metrics/micrometer/releases/tag/v1.8.1

Thank you very much for reporting and investigating this!
Next steps: fixing the issues you mentioned above and migrating back to the Sink API.

shakuzen · 2021-12-08T01:39:49Z

However, I'd like to point one more thing out. If we set N equal to 1000 in the test, it still fails for both parallel and sequential cases. Any idea why that could happen?

Yes, we noticed that and have some ideas on why it is happening but need to investigate further. For what it's worth, it is the same behavior on 1.7.x, so it is pre-existing behavior, which is why we're considering it separate from this regression and did not block the patch release on it.

Also I noticed that TcpClient batches samples disregarding the buffering setting. Is that intentional?

Looking at the packet splitting in the TCP case is one of the things we need to look into more.

hdv added the bug A general bug label Dec 1, 2021

shakuzen added the registry: statsd A StatsD Registry related issue label Dec 1, 2021

shakuzen added this to the 1.8.x milestone Dec 1, 2021

shakuzen added the regression A regression from a previous release label Dec 1, 2021

shakuzen changed the title ~~Counter samples are dropped when submitted from parallel threads~~ StatsdCounter increments are dropped when submitted from parallel threads Dec 2, 2021

shakuzen changed the title ~~StatsdCounter increments are dropped when submitted from parallel threads~~ Statsd meter recordings are dropped when submitted from parallel threads Dec 3, 2021

jonatan-ivanov added a commit to jonatan-ivanov/micrometer that referenced this issue Dec 4, 2021

Use the deprecated DirectProcessor

490a6a0

until we figure out what to do. fixes micrometer-metricsgh-2880

jonatan-ivanov added a commit to jonatan-ivanov/micrometer that referenced this issue Dec 4, 2021

Test emitting/counting from multiple threads

01e523d

related: micrometer-metricsgh-2880 Co-Authored-By: Denis Khitrik <1179235+hdv@users.noreply.github.com>

OlegDokuka mentioned this issue Dec 6, 2021

rewroks impl to use unsafe unicast sink with MpScQueue #2892

Open

shakuzen mentioned this issue Dec 7, 2021

Revert to use processor #2893

Merged

shakuzen closed this as completed in 9a6f914 Dec 7, 2021

shakuzen modified the milestones: 1.8.x, 1.8.1 Dec 7, 2021

This was referenced Dec 10, 2021

Heavy reporting load seem to lead in missing samples in StatsdMeterRegistry #2908

Open

TCP reporter in StatsdMeterRegistry doesn't seem to respect buffering flag #2909

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statsd meter recordings are dropped when submitted from parallel threads #2880

Statsd meter recordings are dropped when submitted from parallel threads #2880

hdv commented Dec 1, 2021

hdv commented Dec 1, 2021 •

edited

hdv commented Dec 1, 2021 •

edited

hdv commented Dec 1, 2021

shakuzen commented Dec 2, 2021

hdv commented Dec 2, 2021

simonbasle commented Dec 3, 2021

jonatan-ivanov commented Dec 3, 2021

jonatan-ivanov commented Dec 4, 2021 •

edited

shakuzen commented Dec 6, 2021 •

edited

hdv commented Dec 6, 2021 via email •

edited

shakuzen commented Dec 7, 2021

hdv commented Dec 7, 2021

hdv commented Dec 7, 2021

hdv commented Dec 7, 2021

jonatan-ivanov commented Dec 7, 2021

jonatan-ivanov commented Dec 8, 2021

shakuzen commented Dec 8, 2021

Statsd meter recordings are dropped when submitted from parallel threads #2880

Statsd meter recordings are dropped when submitted from parallel threads #2880

Comments

hdv commented Dec 1, 2021

hdv commented Dec 1, 2021 • edited

hdv commented Dec 1, 2021 • edited

hdv commented Dec 1, 2021

shakuzen commented Dec 2, 2021

hdv commented Dec 2, 2021

simonbasle commented Dec 3, 2021

jonatan-ivanov commented Dec 3, 2021

jonatan-ivanov commented Dec 4, 2021 • edited

shakuzen commented Dec 6, 2021 • edited

hdv commented Dec 6, 2021 via email • edited

shakuzen commented Dec 7, 2021

hdv commented Dec 7, 2021

hdv commented Dec 7, 2021

hdv commented Dec 7, 2021

jonatan-ivanov commented Dec 7, 2021

jonatan-ivanov commented Dec 8, 2021

shakuzen commented Dec 8, 2021

hdv commented Dec 1, 2021 •

edited

hdv commented Dec 1, 2021 •

edited

jonatan-ivanov commented Dec 4, 2021 •

edited

shakuzen commented Dec 6, 2021 •

edited

hdv commented Dec 6, 2021 via email •

edited