Logrus hook causing performance issues #1422

renta · 2023-05-02T16:34:21Z

Describe the bug
It is not a bug report but an attempt to warn on harm from code of logrus logger hook. https://github.com/elastic/apm-agent-go/blob/main/module/apmlogrus/hook.go In our case, this code produce a 5 second performance penalty on API endpoints which produces errors (like http 404 error). Moreover, since this hook was a part of a custom legacy library, it was hard to find a reason of such behavior. But switching off a tracer integration removed the lag, so an Apm Agent and tracing was a first suspect.

To Reproduce
Steps to reproduce the behavior:

Switch on the hook.
Make a http middleware or grpc interceptor which will create a critical log entry from an error of a main server function. I know, this is silly decision, but please take into account that I'm talking about the legacy code and library.
Use a broken link to an APM hostname or a loaded ELK APM tracer on a loaded production service.
Requests with errors would have a 5 seconds performance penalty.

Expected behavior
I think the code of this hook should be deprecated and abandoned in the future. My points for this decision:

the hook intended to work synchronously, so it's usage in the servers middlewares and interceptors will slow down responses (in case of busy APM Server)
since the unpredictable calls of this hook, the flushing load to the APM Server is also upredictable. It could nullify Elastic team attempts to optimise the a communication between the APM Server and the Go Agent.
5 seconds timeout context is too long
this hook breaks an isolation of different infrastructure subsystem of the service: logger for some reason decides to flush calls to an APM server. It's quite an unexpectable behavior.
if you want to flush a tracer for some reason then it's better to use Agent's tracer itself or handle panics via https://www.elastic.co/guide/en/apm/agent/go/current/custom-instrumentation.html#custom-instrumentation-errors instead of hooks.

axw · 2023-05-03T01:08:34Z

apmlogrus attempts to flush for up to 5 seconds on fatal level log messages:

apm-agent-go/module/apmlogrus/hook.go

Lines 118 to 120 in c953eed

    
           if entry.Level == logrus.FatalLevel { 
        
           	// In its default configuration, logrus will exit the process 
        
           	// following a fatal log message, so we flush the tracer.

Given that the default behaviour for fatal level logging is for logrus to exit the process, it seems like a reasonable behaviour. You can override the timeout by setting Hook.FatalFlushTimeout:

apm-agent-go/module/apmlogrus/hook.go

Line 67 in c953eed

FatalFlushTimeout time.Duration

dmathieu · 2023-05-03T07:03:39Z

Thank you for this issue. I just renamed it to have a more meaningful (and less clickbait-y) title.

renta · 2023-05-03T07:20:51Z

I admit that we've faced performance issue because of some silly decisions in an old library and misconfigured logger. But I also want to point out that it's a bad system design to rule one infra subsystem from another with some hookish code. It's not obvious for people who would debug same problems that the root of such a lag is a logger but not the tracer or APM agent itself.

github-actions bot added the agent-go label May 2, 2023

dmathieu changed the title ~~Logrus hook considered harmful~~ Logrus hook causing performance issues May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logrus hook causing performance issues #1422

Logrus hook causing performance issues #1422

renta commented May 2, 2023

axw commented May 3, 2023

dmathieu commented May 3, 2023

renta commented May 3, 2023

Logrus hook causing performance issues #1422

Logrus hook causing performance issues #1422

Comments

renta commented May 2, 2023

axw commented May 3, 2023

dmathieu commented May 3, 2023

renta commented May 3, 2023