proxy: Apparent connection leak #5334

olix0r · 2020-12-04T19:04:09Z

I've been running a burn-in/load test that highlights what appears to be a memory leak in recent proxies.

The load generators are configured with:

5 threads
A maximum of 5000 requests per second
A maximum of 20 requests in-flight at a time
Targeting a service with 3 server instances

Over 8 hours, we observe sustained RSS growth for both the HTTP/1 load generator and the target servers (which serve both HTTP/1 and gRPC traffic). The gRPC load generator does not exhibit this behavior. It appears that load generator's rate of growth slows over time, but it doesn't appear to totally stabilize.

olix0r · 2020-12-04T20:31:00Z

This issue persists with linkerd/linkerd2-proxy#732 and may even by slightly worse (or this may just be some variance between test runs.

As suspected, this appears to correlate directly with the number of connections the proxy maintains:

This basically confirms that #4900 is still a legitimate issue.

olix0r · 2020-12-07T17:49:00Z

It turns out that this is actually a connection leak, which manifests as increased memory usage.

Middlewares, especially the cache, may want to use RAII to detect when a service is idle, but the TCP server and HTTP routers drop services as soon as the request is dispatched. This change modifies the HTTP server to hold the TCP stack until all work on that connection is complete. It also introduces a new `http::Retain` middleware that clones its inner service into repsonse bodies. This is necesarry for upcoming cache eviction changes to address linkerd/linkerd2#5334.

Middlewares, especially the cache, may want to use RAII to detect when a service is idle, but the TCP server and HTTP routers drop services as soon as the request is dispatched. This change modifies the TCP server to hold the TCP stack until all work on that connection is complete. It also introduces a new `http::Retain` middleware that clones its inner service into response bodies. This is necessary for upcoming cache eviction changes to address linkerd/linkerd2#5334.

Cache eviction is currently triggered when a service has not processed new requests for some idle timeout. This is fragile for caches that may process a few long-lived requests. In practice, we would prefer to only start tracking idleness when there are no *active* requests on a service. This change restructures the cache to return services that are wrapped with tracking handles, rather than passing a tracking handle into the inner service. When a returned service is dropped, it spawns a background task that retains the handle for an idle timeout and, if no new instances have acquire the handle have that timeout, removes the service from the cache. This change reduces latency as well as CPU and memory utilization in load tests. Fixes linkerd/linkerd2#5334

Cache eviction is currently triggered when a service has not processed new requests for some idle timeout. This is fragile for caches that may process a few long-lived requests. In practice, we would prefer to only start tracking idleness when there are no *active* requests on a service. This change restructures the cache to return services that are wrapped with tracking handles, rather than passing a tracking handle into the inner service. When a returned service is dropped, it spawns a background task that retains the handle for an idle timeout and, if no new instances have acquire the handle have that timeout, removes the service from the cache. This change reduces latency as well as CPU and memory utilization in load tests. Furthermore, it ultimately eliminates the need for a specialized buffer implementation. Fixes linkerd/linkerd2#5334

olix0r added area/proxy priority/P0 Release Blocker labels Dec 4, 2020

olix0r added this to To do in 2.10 - backlog via automation Dec 4, 2020

olix0r changed the title ~~proxy: Apparent memory leak~~ proxy: Apparent connection leak Dec 7, 2020

olix0r added this to To do in 2.9.1 via automation Dec 7, 2020

olix0r self-assigned this Dec 8, 2020

olix0r removed this from To do in 2.9.1 Dec 10, 2020

olix0r mentioned this issue Dec 10, 2020

Ensure that services are held as long they are being used linkerd/linkerd2-proxy#767

Merged

olix0r mentioned this issue Dec 11, 2020

cache: Ensure that actively held services are not evicted linkerd/linkerd2-proxy#768

Merged

olix0r closed this as completed in linkerd/linkerd2-proxy#768 Dec 12, 2020

2.10 - backlog automation moved this from To do to Done Dec 12, 2020

olix0r added this to the stable-2.10 milestone Jan 11, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proxy: Apparent connection leak #5334

proxy: Apparent connection leak #5334

olix0r commented Dec 4, 2020

olix0r commented Dec 4, 2020

olix0r commented Dec 7, 2020

proxy: Apparent connection leak #5334

proxy: Apparent connection leak #5334

Comments

olix0r commented Dec 4, 2020

olix0r commented Dec 4, 2020

olix0r commented Dec 7, 2020