Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out why LockFreePool appears to cause unintended object retention (~= memory leak) #1260

Closed
cowtowncoder opened this issue Apr 15, 2024 · 13 comments

Comments

@cowtowncoder
Copy link
Member

cowtowncoder commented Apr 15, 2024

Based on report problem like:

it looks as if the new recycler pool default (see #1117) -- Lock-free Pool -- results in (at least in some cases) unbounded increasing memory retention over time. From symptoms it would appear as if actual releases to pool happened more often than succesful reuse from the pool, as otherwise 2 operations should generally balance out over time (resulting in maximum pool size relative to maximum concurrent use).

We need to figure out what could cause such behavior, or if not, what is the root cause of observed problems.

In the meantime we also will need to do #1256 to allow use of 2.17(.1) without this issue.

@cowtowncoder
Copy link
Member Author

/cc @mariofusco @pjfanning -- I hope to learn if there is some specific usage scenario and create multi-threaded reproduction. I am suspecting that somehow reuse from the pool is failing, but perhaps not for every call. jackson-benchmarks, for example, does not exhibit this problem: but it also does not by default use heavy concurrency and tests do not run long enough to necessarily accumulate humongous buffers.

Also: it is interesting that the one concern I did have wrt unbounded pools ended up happening, it seems.
So it feels like my insistence on ideally having maximum size to avoid this problem (... although, doing so might hide an issue to) seems warranted. For whatever that is worth.

@mariofusco
Copy link
Contributor

HI @cowtowncoder I would like to investigate this memory issue, but at the moment I have no clue on how to reproduce it on my side. I also asked here and I'm waiting for an answer. I'm reviewing its implementation again, and in all honesty it seems quite straightforward, so at the moment I don't know what to do about it. About making the LockFree pool bounded, this is of course feasible, but I'm against for a few reasons:

  1. In general I don't like blind fixes
  2. In this specific case making the pool bounded won't help as to investigate the problem, but only to put it under the carpet
  3. Making that pool bounded won't be free however, both adding some further complexity to it and introducing a performance cost
  4. We already have a bounded pool implementation, with performance very similar to the LockFree, so if we want a bounded pool as default implementation, let's just use it

@mariofusco
Copy link
Contributor

Thinking twice, now I see an admittedly weird situation where the LockFree pool can grow indefinitely. In that pool there is an heuristic to limit the number of retries under heavy contention. When this happens it gives up attempting to retrieve a resource from the pool and creates a new one instead. However this heuristic is implemented symmetrically on both acquire and release, so similarly on release it gives up trying to put back a resource into the pool after a few failed attempts. In essence the pool will grow if there is a huge asymmetry in its usage, when the acquire is always under heavy contention but the release isn't. I'm not sure how to deal with this situation to be honest, but still I would like to see it happens into the wild before deciding on any action.

@cowtowncoder
Copy link
Member Author

@mariofusco Your thoughts align remarkably closely with my thinking :)
(not that this guarantees we are necessarily right but it is ok sanity check I think)

As to reproduction, what I was thinking was indeed running a multi-threaded scenario with N threads, reading and writing, against modified copy of pool which allows accessing size, printing it periodically and checking if size would be growing.
I guess one unfortunate part is that any "lost" acquire will add to size and although in some use cases the opposite could occur too (release does not happen on some code path, I assume that is possible too), there are probably cases where release works reliably so even small incremental unbalance will lead to increase retention over time.

As to adding bounded size: I 100% agree that we need to understand the problem first and not add something that would likely just hide the problem.
Once we know what is going on we can reconsider it.
(I have some ideas on how it could work but not actual plan; avoiding need to use single synchronized counter)

@franz1981
Copy link

franz1981 commented Apr 16, 2024

I am now reading past the issues and probably giving up on contention is not what we want.
Contention is an inherent problem of concurrent primitives like (compareAndSet-like), but allocating while still being able to release it back, can just causes unbounded growth, which eventually is going again to consume the cpu we tried hard to save (spending it in GC marking and eventually moving, given that when grow it doesn't shrink).
Right now the easier choice imo is to use a counter on acquire/release to keep on trying with spinning and Thread::onSpinWait or yield is there are too many attempts, but without giving up.

Anothe solution can be to spread the contention or uses better primitives which "cannot fail" eg getAndSet

But that means redesigning what it does; just serialising while too contended seems the easier choice to me.

@cowtowncoder
Copy link
Member Author

I may be wrong, but to me it seems like ConcurrentDequePool might be the safer but similar implementation to use -- it uses Deque for storing pooled objects.
And compared to BoundedPool it has the benefit of lower overhead for cases where pool itself is not used (since there's no allocations with Deque, it being linked list).

cowtowncoder added a commit that referenced this issue Apr 17, 2024
@cowtowncoder
Copy link
Member Author

@mariofusco @franz1981: I created #1265 which sort of hints at possible problem, but isn't smoking gun necessarily. When running through 3 main implementations, LockFreePool occasionally gets a REALLY high pooled count -- I have seen 18,000 once (!) -- but seems to stabilize over time. Other implementations remain typically under thread count (going over isn't unexpected since test runs read and write for some loops, so getting to 2x is theoretically possible).

So no, this does not show monotonically increasing behavior, but suggests that spikes are possible.
It is of course possible (likely?) that contention behavior is heavily platform-dependant and maybe running on Eclipse on Mac mini just doesn't exhibit extreme contention.
I'll see if increasing thread count from 100 up (or, going down to, say, 20) can jog other behavior.

But if anyone else has time to run test, and/or read code and suggest changes, that'd be useful.

@kkolyan
Copy link

kkolyan commented Apr 17, 2024

Yet another report with "memory leak" reproduction: https://github.com/kkolyan/jackson_leak.
It uses ObjectMapper from 1000 threads and during an hour occupy heap of 8G (with initial occupation by test data of 600M), leading to permanent freeze.

@cowtowncoder
Copy link
Member Author

cowtowncoder commented Apr 17, 2024

@kkolyan Thank you very much for creating this! I hope to try it out locally, and see if the other 2 implementations (and in particular, Deque backed ones) fared better.

EDIT: looks like it's not trivial to change impl to measure other impls due to RecyclerPoolSizeCalculator (I used 2.18.0-SNAPSHOT where RecyclerPool.pooledCount() was added).

cowtowncoder added a commit that referenced this issue Apr 18, 2024
cowtowncoder added a commit that referenced this issue Apr 18, 2024
@cowtowncoder
Copy link
Member Author

@kkolyan Trying to run the reproduction, but since I am not a gradle user, not sure what invocation is expected. Guessing from build.gradle tries

./gradlew test
./gradlew application

neither of which works (build succeeds with ./gradlew build).
It'd be good to add invocation on README.md of the repo?

Also had to add mavenLocal() repository to use jackson-core locally built 2.17.1-SNAPSHOT (or could have added Sonatype OSS repository that has those built too).

@cowtowncoder
Copy link
Member Author

I'm able to run the test on IDE just fine. Not yet seeing much of a trend, assuming I read output correctly.

But one related thing is this: I backported addition of

    default int pooledCount() {
        return -1;
    }

in RecyclerPool for 2.17 branch, so locally built (or one from Sonatype OSS repo) jackson-core has that available. This is useful to:

  1. Remove need for RecyclerPoolSizeCalculator since RecyclerPool.pooledCount() can be used
  2. Allow use of all RecyclerPool implementations to see that they don't (... or do?) exhibit similar leakage

@cowtowncoder
Copy link
Member Author

cowtowncoder commented Apr 20, 2024

Ok: created and resolved work (#1256 for 2.17.1 revert, #1266 for 2.18 update to concurrentDequePool).

One last thing: should we try to fix "lockFreePool", or deprecate? (and remove from 3.0).

Create a placeholder ticket: #1271.

@cowtowncoder
Copy link
Member Author

Ok, I think we have a reasonable idea as to how the retention may happen: due to imbalanced failures between acquiring reusable buffer (fails more often) vs releasing buffer to the pool (fails less often).
Given this, closing the issue; will proceed with #1271.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants