Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd CPU Hotspots and Thread Usage #2382

Open
1 task done
j0sh3rs opened this issue Mar 26, 2020 · 6 comments
Open
1 task done

Linkerd CPU Hotspots and Thread Usage #2382

j0sh3rs opened this issue Mar 26, 2020 · 6 comments

Comments

@j0sh3rs
Copy link
Contributor

j0sh3rs commented Mar 26, 2020

Issue Type:

  • Bug report

What happened:
After roughly a week of running performance testing load through linkerd. (through jmeter), we experience a case where linkerd sees a sharp increase in cpu usage and thread count jump, primarily related to netty UnboundedFuturePool usage:

$ grep "UnboundedFuturePool" *json
threads.json:      "thread" : "UnboundedFuturePool-419",
threads.json:      "thread" : "UnboundedFuturePool-421",
threads.json:      "thread" : "UnboundedFuturePool-420",
threads.json:      "thread" : "UnboundedFuturePool-418",
threads.json:      "thread" : "UnboundedFuturePool-416",
threads.json:      "thread" : "UnboundedFuturePool-422",
threads_10-77-89-88.json:      "thread" : "UnboundedFuturePool-20",

When doing sampling against the profiles, the cpu hotspots look like this:

Total: 25266 samples
    7771  30.8%  30.8%     7771  30.8% io.netty.channel.epoll.Native.epollWait0
    1971   7.8%  38.6%     1971   7.8% sun.misc.Unsafe.getInt
    1654   6.5%  45.1%     1654   6.5% java.lang.Thread.currentThread
     884   3.5%  48.6%     1007   4.0% com.twitter.finagle.http.DefaultHeaderMap$Headers.elemHashCode
     856   3.4%  52.0%      856   3.4% java.lang.String.charAt
     811   3.2%  55.2%     1326   5.2% com.twitter.finagle.http.Rfc7230HeaderValidation$.validateValue

Only after restarting linkerd (by patching the daemonset pods) does the issue resolve, only to reappear

What you expected to happen:
Linkerd's thread and cpu usage remain appropriate for the load it is receiving.

How to reproduce it (as minimally and precisely as possible):
run nightly jmeter load test for 7-10 days. Note: the issue is also observed in an environment where no Jmeter test runs, suggestive that it's not specifically tied to the jmeter usage.

Anything else we need to know?:
We attempted to work around the issue, suspecting it could be related to #2268 but still saw the same behavior while running with BiasedLocking enabled.

Some core configs of our jmeter setup include:

httpclient.reset_state_on_thread_group_iteration=false
httpclient4.validate_after_inactivity=66600
httpclient4.time_to_live=70000

with Use KeepAlive checked on the jobs

Environment:

  • linkerd/namerd version, config files:
    Linkerd 1.7.1 (running on default java8)
    Config:
admin:
      port: 9990
      ip: 0.0.0.0
      socketOptions:
        reusePort: true

    namers:
    - kind: io.l5d.k8s
      host: localhost
      port: 8001

    telemetry:
    - kind: io.l5d.prometheus

    routers:
    - protocol: http
      label: path
      streamAfterContentLengthKB: 1024
      streamingEnabled: true
      client:
        failureAccrual:
          kind: none
        hostConnectionPool:
          minSize: 5
        requeueBudget:
          percentCanRetry: 20.0
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: linkerd-dtabs
        namespace: ping-services
        filename: dtab
      identifier:
        kind: io.l5d.path
        segments: 1
        consume: true
      servers:
      - port: 4140
        ip: 0.0.0.0
        clearContext: true
        socketOptions:
          reusePort: true
      service:
        responseClassifier:
          kind: io.l5d.http.retryableRead5XX

    - protocol: http
      label: path-tls
      streamAfterContentLengthKB: 1024
      streamingEnabled: true
      client:
        failureAccrual:
          kind: none
        hostConnectionPool:
          minSize: 5
        requeueBudget:
          percentCanRetry: 20.0
      interpreter:
        kind: io.l5d.k8s.configMap
        experimental: true
        name: linkerd-dtabs
        namespace: ping-services
        filename: dtab
      identifier:
        kind: io.l5d.path
        segments: 1
        consume: true
      servers:
      - port: 4141
        ip: 0.0.0.0
        clearContext: true
        socketOptions:
          reusePort: true
        tls:
          certPath: /certificates/tls.crt
          keyPath: /certificates/tls.key
  • Platform, version, and config files (Kubernetes, DC/OS, etc):
    Kubernetes 1.15.9 running on ubuntu 16.04
  • Cloud provider or hardware configuration:
    AWS m5.4xlarge instance type with EBS optimizations
@j0sh3rs
Copy link
Contributor Author

j0sh3rs commented Mar 26, 2020

@cpretzer since you were helping us earlier this year :)

@cpretzer
Copy link
Contributor

thanks @j0sh3rs I'll have a look!

@j0sh3rs
Copy link
Contributor Author

j0sh3rs commented Jul 23, 2020

@adleong @cpretzer any chance this would've been solved by the recent netty, node and finagle updates from 1.7.3?

@cpretzer
Copy link
Contributor

@j0sh3rs we've been looking into whether the recent netty and finagle updates would address this issue.

So far, I haven't been able to get a test environment running to reproduce. Can you tell me more about the jmeter tests? Are they hitting your application in a scripted way? Or do they just throw load at the application?

@j0sh3rs
Copy link
Contributor Author

j0sh3rs commented Aug 28, 2020

@cpretzer unfortunately, I've changed roles and am no longer with Ping Identity, so I don't have the context anymore to be able to troubleshoot the jemeter behaviors any longer. I'm not sure who, if anyone, has taken this over from me, so it may be this would go stale and should be closed.

@cpretzer
Copy link
Contributor

@j0sh3rs thanks for the update! I hope your new role is going well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

2 participants