Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with stackdriver in case publish method of StackdriverMeterRegistry is called with too litle time distance #4763

Open
juergencodes opened this issue Feb 20, 2024 · 2 comments
Labels
registry: stackdriver A StackDriver Registry related issue
Milestone

Comments

@juergencodes
Copy link

juergencodes commented Feb 20, 2024

Describe the bug
This is a follow up of #4353. We tried the new version 1.12.3, for which the close now happens in a proper way. However, this revealed a new problem: When StackdriverMeterRegistry is closed, there is an error from GCP that complains about two or more time series with timestamps too close to each other.

com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: global{} timeSeries[49]: custom.googleapis.com/spring/batch/step/active/activeTasks{spring_batch_step_status:EXECUTING,spring_batch_step_name:pubsub,instanceId:20240221074957,application:my-application,environment:develop,spring_batch_step_job_name:my-job}; One or more points were written more frequently than the maximum sampling period configured for the metric.: global{} timeSeries[22]: custom.googleapis.com/spring/batch/step/max{spring_batch_step_name:pubsub,spring_batch_step_status:COMPLETED,instanceId:20240221074957,error:none,application:my-application,environment:develop,spring_batch_step_job_name:my-job};
...

custom.googleapis.com/hikaricp/connections/min{instanceId:20240221074957,application:my-application,environment:develop,pool:HikariPool-1}
	at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92) ~[gax-2.41.0.jar:2.41.0]
	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:98) ~[gax-grpc-2.41.0.jar:2.41.0]
	at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66) ~[gax-grpc-2.41.0.jar:2.41.0]
	at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97) ~[gax-grpc-2.41.0.jar:2.41.0]
	at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:84) ~[api-common-2.24.0.jar:2.24.0]
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1127) ~[guava-32.1.3-jre.jar:na]
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31) ~[guava-32.1.3-jre.jar:na]
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1286) ~[guava-32.1.3-jre.jar:na]
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1055) ~[guava-32.1.3-jre.jar:na]
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:807) ~[guava-32.1.3-jre.jar:na]
	at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:568) ~[grpc-stub-1.61.0.jar:1.61.0]
	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:538) ~[grpc-stub-1.61.0.jar:1.61.0]
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39) ~[grpc-api-1.61.0.jar:1.61.0]
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23) ~[grpc-api-1.61.0.jar:1.61.0]
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40) ~[grpc-api-1.61.0.jar:1.61.0]
	at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:570) ~[gax-grpc-2.41.0.jar:2.41.0]
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:574) ~[grpc-core-1.61.0.jar:1.61.0]
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:72) ~[grpc-core-1.61.0.jar:1.61.0]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742) ~[grpc-core-1.61.0.jar:1.61.0]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723) ~[grpc-core-1.61.0.jar:1.61.0]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[grpc-core-1.61.0.jar:1.61.0]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[grpc-core-1.61.0.jar:1.61.0]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:1583) ~[na:na]
	Suppressed: com.google.api.gax.rpc.AsyncTaskException: Asynchronous task failed
		at com.google.api.gax.rpc.ApiExceptions.callAndTranslateApiException(ApiExceptions.java:57) ~[gax-2.41.0.jar:2.41.0]
		at com.google.api.gax.rpc.UnaryCallable.call(UnaryCallable.java:112) ~[gax-2.41.0.jar:2.41.0]
		at com.google.cloud.monitoring.v3.MetricServiceClient.createTimeSeries(MetricServiceClient.java:1906) ~[google-cloud-monitoring-3.35.0.jar:3.35.0]
		at io.micrometer.stackdriver.StackdriverMeterRegistry.publish(StackdriverMeterRegistry.java:222) ~[micrometer-registry-stackdriver-1.12.3.jar:1.12.3]
		at io.micrometer.core.instrument.push.PushMeterRegistry.publishSafelyOrSkipIfInProgress(PushMeterRegistry.java:64) ~[micrometer-core-1.12.3.jar:1.12.3]
		at io.micrometer.core.instrument.push.PushMeterRegistry.close(PushMeterRegistry.java:133) ~[micrometer-core-1.12.3.jar:1.12.3]
		at io.micrometer.core.instrument.step.StepMeterRegistry.close(StepMeterRegistry.java:163) ~[micrometer-core-1.12.3.jar:1.12.3]
		at io.micrometer.stackdriver.StackdriverMeterRegistry.close(StackdriverMeterRegistry.java:143) ~[micrometer-registry-stackdriver-1.12.3.jar:1.12.3]
		at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
		at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
		at org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:316) ~[spring-beans-6.1.3.jar:6.1.3]
		at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:249) ~[spring-beans-6.1.3.jar:6.1.3]
		at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:587) ~[spring-beans-6.1.3.jar:6.1.3]
		at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:559) ~[spring-beans-6.1.3.jar:6.1.3]
		at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:1202) ~[spring-beans-6.1.3.jar:6.1.3]
		at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:520) ~[spring-beans-6.1.3.jar:6.1.3]
		at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:1195) ~[spring-beans-6.1.3.jar:6.1.3]
		at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1183) ~[spring-context-6.1.3.jar:6.1.3]
		at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1144) ~[spring-context-6.1.3.jar:6.1.3]
		at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1090) ~[spring-context-6.1.3.jar:6.1.3]
		at org.springframework.boot.SpringApplicationShutdownHook.closeAndWait(SpringApplicationShutdownHook.java:145) ~[spring-boot-3.2.2.jar:3.2.2]
		at java.base/java.lang.Iterable.forEach(Iterable.java:75) ~[na:na]
		at org.springframework.boot.SpringApplicationShutdownHook.run(SpringApplicationShutdownHook.java:114) ~[spring-boot-3.2.2.jar:3.2.2]
		... 1 common frames omitted

We suspect that this is caused by following reasons:

  • Both overridden close methods of PushMeterRegistry and StepMeterRegistry separately call the publish method with just a few millis in between. While PushMeterRegistry will basically always call publish, StepMeterRegistry will only do in case there are recent poll results (see shouldPublishDataForLastStep). In our setup this was however, almost always the case.
  • Even after there will be a fix to make sure PushMeterRegistry and StepMeterRegistry will call publish only once when their close methods are called, we expect the same issue to happen again, because the call to close may in some situations happen right after the scheduler in PushMeterRegistry fired a publish.

Environment

  • Micrometer version: 1.12.3
  • Micrometer registry: StackdriverMeterRegistry
  • Google Cloud Run
  • Spring Boot 3.2.2

To Reproduce
Simply using StackdriverMeterRegistry will cause the issue.

Expected behavior
Close methods of PushMeterRegistry and StepMeterRegistry should be fixed in a way that the publish method will get called only once. Moreoever, there should be a new config parameter in StackdriverConfig to specify the minimum time between two calls to its publish method. In case there is a call before the configured amount of time has passed, the method should sleep before doing the publish. Probably, this must happen before the local class Batch is instantiated, as this actually sets the timestamp of the time series.

Additional context
This problem only appeared after #4353 was fixed.

@shakuzen shakuzen added registry: stackdriver A StackDriver Registry related issue and removed waiting-for-triage labels Feb 21, 2024
@shakuzen shakuzen added this to the 1.11.x milestone Feb 21, 2024
@shakuzen
Copy link
Member

Thanks for the report. I'm surprised other users didn't report this before when I asked for verification of the fix in #4353. Regardless, this is the kind of thing that could have been caught if we had an integration test setup with Stackdriver. I'll look into the feasibility of getting that setup.

Close methods of PushMeterRegistry and StepMeterRegistry should be fixed in a way that the publish method will get called only once.

We probably would not change those classes to fix a limitation that only affects Stackdriver. The StackdriverMeterRegistry could have its own implementation of close that does not delegate to those parent classes. I'm not immediately sure what it would look like, though.

Moreoever, there should be a new config parameter in StackdriverConfig to specify the minimum time between two calls to its publish method. In case there is a call before the configured amount of time has passed, the method should sleep before doing the publish.

I prefer to not make things configurable unless they need to be and there is enough benefit. It's a global Stackdriver limitation that reported time series can't be within a certain time of each other, right? If so, the minimum time is fixed and doesn't need to be configurable. That said, I don't especially like the idea of sleeping while something is waiting for the registry to close, but if it is short enough and there isn't a better way to solve this, it may be unavoidable. What is the minimum time difference required?

@juergencodes
Copy link
Author

We probably would not change those classes to fix a limitation that only affects Stackdriver. The StackdriverMeterRegistry could have its own implementation of close that does not delegate to those parent classes. I'm not immediately sure what it would look like, though.

That's fair enough. My thought was most targeting towards if this double publish could potentially lead to issues for other technology adapters that I of course do not overview. I'm a bit skeptical with respect to an own implementation of close in StackdriverMeterRegistry, because you would again have to deal with the things that the super classes do, including the stuff in MeterRegistry. To my understanding, it would then make sense to follow only the path that StackdriverMeterRegistry makes sure to avoid time series with too little distance - no matter if it resulted from the two publish while closing, timing issues as described above or any other reason. I will change the title of the issue likewise.

Moreoever, there should be a new config parameter in StackdriverConfig to specify the minimum time between two calls to its publish method. In case there is a call before the configured amount of time has passed, the method should sleep before doing the publish.

I prefer to not make things configurable unless they need to be and there is enough benefit. It's a global Stackdriver limitation that reported time series can't be within a certain time of each other, right? If so, the minimum time is fixed and doesn't need to be configurable. That said, I don't especially like the idea of sleeping while something is waiting for the registry to close, but if it is short enough and there isn't a better way to solve this, it may be unavoidable. What is the minimum time difference required?

I totally second to avoid unnecessary config options. However, as far as I know, the minimum time is configurable in GCP. For our setup I got the value of 5 s. I will talk with the team later to see if we have more meaningful information on this.

PS: While trying workarounds for this issue, we of course thought of subclassing StackdriverMeterRegistry to add the sleep logic in an override implementation. However, this does not work for the fact that StackdriverMeterRegistry most important constructor is private and available to the builder only. Hence, we were not able to instantiate it with MetricServiceSettings. In case you do not see anything that contradicts making the second constructor protected - or provide another one with the MetricServiceSettings, this would simplify working around future issues.

@juergencodes juergencodes changed the title Close methods of PushMeterRegistry and StepMeterRegistry are causing multiple calls to publish Issue with stackdriver in case publish method of StackdriverMeterRegistry is called with too litle time distance Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
registry: stackdriver A StackDriver Registry related issue
Projects
None yet
Development

No branches or pull requests

2 participants