Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logrus hook causing performance issues #1422

Open
renta opened this issue May 2, 2023 · 3 comments
Open

Logrus hook causing performance issues #1422

renta opened this issue May 2, 2023 · 3 comments
Labels

Comments

@renta
Copy link

renta commented May 2, 2023

Describe the bug
It is not a bug report but an attempt to warn on harm from code of logrus logger hook. https://github.com/elastic/apm-agent-go/blob/main/module/apmlogrus/hook.go In our case, this code produce a 5 second performance penalty on API endpoints which produces errors (like http 404 error). Moreover, since this hook was a part of a custom legacy library, it was hard to find a reason of such behavior. But switching off a tracer integration removed the lag, so an Apm Agent and tracing was a first suspect.

To Reproduce
Steps to reproduce the behavior:

  1. Switch on the hook.
  2. Make a http middleware or grpc interceptor which will create a critical log entry from an error of a main server function. I know, this is silly decision, but please take into account that I'm talking about the legacy code and library.
  3. Use a broken link to an APM hostname or a loaded ELK APM tracer on a loaded production service.
  4. Requests with errors would have a 5 seconds performance penalty.

Expected behavior
I think the code of this hook should be deprecated and abandoned in the future. My points for this decision:

  • the hook intended to work synchronously, so it's usage in the servers middlewares and interceptors will slow down responses (in case of busy APM Server)
  • since the unpredictable calls of this hook, the flushing load to the APM Server is also upredictable. It could nullify Elastic team attempts to optimise the a communication between the APM Server and the Go Agent.
  • 5 seconds timeout context is too long
  • this hook breaks an isolation of different infrastructure subsystem of the service: logger for some reason decides to flush calls to an APM server. It's quite an unexpectable behavior.
  • if you want to flush a tracer for some reason then it's better to use Agent's tracer itself or handle panics via https://www.elastic.co/guide/en/apm/agent/go/current/custom-instrumentation.html#custom-instrumentation-errors instead of hooks.
@axw
Copy link
Member

axw commented May 3, 2023

apmlogrus attempts to flush for up to 5 seconds on fatal level log messages:

if entry.Level == logrus.FatalLevel {
// In its default configuration, logrus will exit the process
// following a fatal log message, so we flush the tracer.

Given that the default behaviour for fatal level logging is for logrus to exit the process, it seems like a reasonable behaviour. You can override the timeout by setting Hook.FatalFlushTimeout:

FatalFlushTimeout time.Duration

@dmathieu dmathieu changed the title Logrus hook considered harmful Logrus hook causing performance issues May 3, 2023
@dmathieu
Copy link
Member

dmathieu commented May 3, 2023

Thank you for this issue. I just renamed it to have a more meaningful (and less clickbait-y) title.

@renta
Copy link
Author

renta commented May 3, 2023

I admit that we've faced performance issue because of some silly decisions in an old library and misconfigured logger. But I also want to point out that it's a bad system design to rule one infra subsystem from another with some hookish code. It's not obvious for people who would debug same problems that the root of such a lag is a logger but not the tracer or APM agent itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants