Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: Apparent connection leak #5334

Closed
olix0r opened this issue Dec 4, 2020 · 2 comments · Fixed by linkerd/linkerd2-proxy#768
Closed

proxy: Apparent connection leak #5334

olix0r opened this issue Dec 4, 2020 · 2 comments · Fixed by linkerd/linkerd2-proxy#768
Assignees
Labels
Milestone

Comments

@olix0r
Copy link
Member

olix0r commented Dec 4, 2020

I've been running a burn-in/load test that highlights what appears to be a memory leak in recent proxies.

The load generators are configured with:

  • 5 threads
  • A maximum of 5000 requests per second
  • A maximum of 20 requests in-flight at a time
  • Targeting a service with 3 server instances

Over 8 hours, we observe sustained RSS growth for both the HTTP/1 load generator and the target servers (which serve both HTTP/1 and gRPC traffic). The gRPC load generator does not exhibit this behavior. It appears that load generator's rate of growth slows over time, but it doesn't appear to totally stabilize.

image

@olix0r olix0r added area/proxy priority/P0 Release Blocker labels Dec 4, 2020
@olix0r olix0r added this to To do in 2.10 - backlog via automation Dec 4, 2020
@olix0r
Copy link
Member Author

olix0r commented Dec 4, 2020

This issue persists with linkerd/linkerd2-proxy#732 and may even by slightly worse (or this may just be some variance between test runs.

As suspected, this appears to correlate directly with the number of connections the proxy maintains:

image

This basically confirms that #4900 is still a legitimate issue.

@olix0r
Copy link
Member Author

olix0r commented Dec 7, 2020

It turns out that this is actually a connection leak, which manifests as increased memory usage.

@olix0r olix0r changed the title proxy: Apparent memory leak proxy: Apparent connection leak Dec 7, 2020
@olix0r olix0r added this to To do in 2.9.1 via automation Dec 7, 2020
@olix0r olix0r self-assigned this Dec 8, 2020
@olix0r olix0r removed this from To do in 2.9.1 Dec 10, 2020
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Dec 10, 2020
Middlewares, especially the cache, may want to use RAII to detect when a
service is idle, but the TCP server and HTTP routers drop services as
soon as the request is dispatched.

This change modifies the HTTP server to hold the TCP stack until all
work on that connection is complete. It also introduces a new
`http::Retain` middleware that clones its inner service into repsonse
bodies.

This is necesarry for upcoming cache eviction changes to address
linkerd/linkerd2#5334.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Dec 11, 2020
Middlewares, especially the cache, may want to use RAII to detect when a
service is idle, but the TCP server and HTTP routers drop services as
soon as the request is dispatched.

This change modifies the TCP server to hold the TCP stack until all
work on that connection is complete. It also introduces a new
`http::Retain` middleware that clones its inner service into response
bodies.

This is necessary for upcoming cache eviction changes to address
linkerd/linkerd2#5334.
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Dec 11, 2020
Cache eviction is currently triggered when a service has not processed
new requests for some idle timeout. This is fragile for caches that may
process a few long-lived requests. In practice, we would prefer to only
start tracking idleness when there are no *active* requests on a
service.

This change restructures the cache to return services that are wrapped
with tracking handles, rather than passing a tracking handle into the
inner service. When a returned service is dropped, it spawns a
background task that retains the handle for an idle timeout and, if no
new instances have acquire the handle have that timeout, removes the
service from the cache.

This change reduces latency as well as CPU and memory utilization in
load tests.

Fixes linkerd/linkerd2#5334
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Dec 11, 2020
Cache eviction is currently triggered when a service has not processed
new requests for some idle timeout. This is fragile for caches that may
process a few long-lived requests. In practice, we would prefer to only
start tracking idleness when there are no *active* requests on a
service.

This change restructures the cache to return services that are wrapped
with tracking handles, rather than passing a tracking handle into the
inner service. When a returned service is dropped, it spawns a
background task that retains the handle for an idle timeout and, if no
new instances have acquire the handle have that timeout, removes the
service from the cache.

This change reduces latency as well as CPU and memory utilization in
load tests.

Fixes linkerd/linkerd2#5334
2.10 - backlog automation moved this from To do to Done Dec 12, 2020
olix0r added a commit to linkerd/linkerd2-proxy that referenced this issue Dec 12, 2020
Cache eviction is currently triggered when a service has not processed
new requests for some idle timeout. This is fragile for caches that may
process a few long-lived requests. In practice, we would prefer to only
start tracking idleness when there are no *active* requests on a
service.

This change restructures the cache to return services that are wrapped
with tracking handles, rather than passing a tracking handle into the
inner service. When a returned service is dropped, it spawns a
background task that retains the handle for an idle timeout and, if no
new instances have acquire the handle have that timeout, removes the
service from the cache.

This change reduces latency as well as CPU and memory utilization in
load tests. Furthermore, it ultimately eliminates the need for a specialized
buffer implementation.

Fixes linkerd/linkerd2#5334
@olix0r olix0r added this to the stable-2.10 milestone Jan 11, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant