New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed goroutine leak in reminders and timers #6523
Conversation
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test to repro this issue?
A test would be nice. I am not sure how to approach writing a test for this, do you have any suggestion? |
@@ -979,9 +979,6 @@ func (a *actorsRuntime) startReminder(reminder *reminders.Reminder, stopChannel | |||
break L | |||
} | |||
|
|||
if nextTimer.Stop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewers: this was deleted because when we hit this stage, the reminder has already fired. Trying to drain it again would cause a deadlock.
Codecov Report
@@ Coverage Diff @@
## master #6523 +/- ##
==========================================
- Coverage 65.89% 65.86% -0.03%
==========================================
Files 199 199
Lines 19224 19229 +5
==========================================
- Hits 12668 12666 -2
- Misses 5549 5552 +3
- Partials 1007 1011 +4
|
I have a manual setup that can reproduce the memory leak in reminders consistently. I'll take these changes and test it out. |
@ItalyPaleAle We can have an integration test that does similar to what Artur did here #6517 (comment) |
My biggest problem is not how to run the test, but how to write a test that reliably determines the absence of the effect. I can't just have a reminder or timer execute and count the goroutines #, because that can be affected by other things (there are background tasks that create goroutines randomly) |
It is OK. Tests can fail due to another root cause later on. It is testing the symptom of go routines leaks by re-registering the same timer over and over again. If the root cause later is from somewhere else, the test can fail and it is OK. Every end-to-end test works that way - it checks for correct behavior not one particular root cause. |
Also, in this case the number of go routines grows linear with the number of requests - it is hard to miss that in the test. Keep running the requests for some time and see if there are 1K go routines - it is a leak any time of the day when it happens. |
I can confirm this PR does not fix the memory leak issue. Running the steps outlined [here] (#6399 (comment)) which helped drive this memory leak fix shows the Dapr process starting at ~30Mb when idle and stays at ~85mb - ~95mb after the execution is completed. |
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Just pushed some more changes:
Hopefully the test doesn't turn out to be flaky... I wrote it as unit test because it allows having more control over what goroutines are created (E2E tests, besides being a lot more complex due to the need of parsing prometheus output, have a lot more going on and number of goroutines there is less stable) and because integration tests don't have placement service and don't have apps running. |
@yaron2 please try with the latest fix, i don't think it was complete. However please note that measuring memory that way (from Docker) isn't accurate because of the GC:
See: https://povilasv.me/prometheus-go-metrics/ Best would be to just check the number of active goroutines, since memory can't really be leaked in Go anyways. |
I will check again with changes you're making to this PR, but surely I can't agree with the statement that Go can't have memory leaks, and I can certainly see memory being returned. With the fix mentioned earlier, the memory reported by Docker hung around at ~300Mb after the same test run, and after the durabletask-go fix it consistently dropped to ~100mb. Furthermore, memory snapshots showed that objects were indeed released from memory (timer Tickers) after the durabletask-go fix was applied, as opposed to snapshots before the fix that showed them still being in use. |
Memory leaks in Go are only possible in very limited cases, including incorrect use of global variables, or goroutine leaks. The issue with durabletask was a goroutine leak too. As for how and when memory is returned, I recommend reading these: |
I read these resources while fixing the durabletask-go issue. Whether this memory leak is caused by a goroutine leak or something else, it is clear based on the tests that this PR doesn't fix the issue, so we need to continue looking. Here are some good reads of a memory leak example with |
Leak does not mean memory not used by the program, it means that there is a reference using the memory and that is stopping GC from returning it. This is similar problem in Java where "Java does not have memory leak" but it does if your program keeps references to object that will never be accessed again. |
Please test with the latest changes. I reckon it wasn't complete when you posted your comment above (only fixed goroutine leaks when reminders were deleted, but not when they fired). It should be fine now, as validated by the tests. |
@ItalyPaleAle can you have a E2E test to repro the mem leak? This way, it is easy to test with and without this change. |
Tested with latest changes, the memory leak still exists to the same effect. |
Could you share the code used for testing please? This way I can look into what's still going on. |
Sure, already did here |
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Ok, it's fixed now. Timers were good, but reminders needed a I used the sample above, but also enabled prometheus and pprof. I can see now that the # of goroutines goes back to where it was before it started. Here's the output of checking the number of goroutines: bash -c "while true; do curl localhost:9091 -s | grep 'go_goroutines' | grep -Fv '#'; sleep 8; done" Output
Note it takes a while for this to go back to "base" level because of the TCP connections still open. pprof shows the goroutines wait for a while blocked on some netpool stuff. After the connections are closed (I would assume that's when the tcp keepalives time out), it goes back to "base" level |
Awesome. Is it possible to have an E2E test for this? |
After the latest changes, the memory is still held at > 105Mb when all reminders finish firing. This may have reduced the number of goroutines, but there are still unclaimed objects in memory. |
* Fixed goroutine leak in reminders and timers Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Added unit tests + some more tweaks Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Fixed last goroutine leaks Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Comments Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Artur Souza <asouza.pro@gmail.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
* Fixed goroutine leak in reminders and timers Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Added unit tests + some more tweaks Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Fixed last goroutine leaks Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Comments Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Artur Souza <asouza.pro@gmail.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com> Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
* Fixed goroutine leak in reminders and timers * Added unit tests + some more tweaks * Fixed last goroutine leaks * Comments --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Artur Souza <asouza.pro@gmail.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
* Fixed goroutine leak in reminders and timers * Added unit tests + some more tweaks * Fixed last goroutine leaks * Comments --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Artur Souza <asouza.pro@gmail.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
Fixes #6517
Fixes #6503 (assuming there aren't other leaks)
Fixes the goroutine leak that was also causing memory to leak when using actor reminders and timers