Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed goroutine leak in reminders and timers #6523

Merged
merged 13 commits into from Jun 20, 2023

Conversation

ItalyPaleAle
Copy link
Contributor

Fixes #6517
Fixes #6503 (assuming there aren't other leaks)

Fixes the goroutine leak that was also causing memory to leak when using actor reminders and timers

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Copy link
Member

@artursouza artursouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test to repro this issue?

@ItalyPaleAle
Copy link
Contributor Author

A test would be nice. I am not sure how to approach writing a test for this, do you have any suggestion?

@@ -979,9 +979,6 @@ func (a *actorsRuntime) startReminder(reminder *reminders.Reminder, stopChannel
break L
}

if nextTimer.Stop() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers: this was deleted because when we hit this stage, the reminder has already fired. Trying to drain it again would cause a deadlock.

@codecov
Copy link

codecov bot commented Jun 15, 2023

Codecov Report

Merging #6523 (ac8dbf2) into master (e7641ff) will decrease coverage by 0.03%.
The diff coverage is 40.00%.

❗ Current head ac8dbf2 differs from pull request most recent head 4b4ac3d. Consider uploading reports for the commit 4b4ac3d to get more accurate results

@@            Coverage Diff             @@
##           master    #6523      +/-   ##
==========================================
- Coverage   65.89%   65.86%   -0.03%     
==========================================
  Files         199      199              
  Lines       19224    19229       +5     
==========================================
- Hits        12668    12666       -2     
- Misses       5549     5552       +3     
- Partials     1007     1011       +4     
Impacted Files Coverage Δ
pkg/actors/actors.go 66.79% <40.00%> (+0.25%) ⬆️

... and 6 files with indirect coverage changes

@JoshVanL JoshVanL mentioned this pull request Jun 15, 2023
17 tasks
@yaron2
Copy link
Member

yaron2 commented Jun 15, 2023

I have a manual setup that can reproduce the memory leak in reminders consistently. I'll take these changes and test it out.

@JoshVanL
Copy link
Contributor

A test would be nice. I am not sure how to approach writing a test for this, do you have any suggestion?

@ItalyPaleAle We can have an integration test that does similar to what Artur did here #6517 (comment)

@ItalyPaleAle
Copy link
Contributor Author

A test would be nice. I am not sure how to approach writing a test for this, do you have any suggestion?

@ItalyPaleAle We can have an integration test that does similar to what Artur did here #6517 (comment)

My biggest problem is not how to run the test, but how to write a test that reliably determines the absence of the effect. I can't just have a reminder or timer execute and count the goroutines #, because that can be affected by other things (there are background tasks that create goroutines randomly)

@ItalyPaleAle ItalyPaleAle added the autoupdate DaprBot will keep the Pull Request up to date with master branch label Jun 16, 2023
@artursouza
Copy link
Member

A test would be nice. I am not sure how to approach writing a test for this, do you have any suggestion?

@ItalyPaleAle We can have an integration test that does similar to what Artur did here #6517 (comment)

My biggest problem is not how to run the test, but how to write a test that reliably determines the absence of the effect. I can't just have a reminder or timer execute and count the goroutines #, because that can be affected by other things (there are background tasks that create goroutines randomly)

It is OK. Tests can fail due to another root cause later on. It is testing the symptom of go routines leaks by re-registering the same timer over and over again. If the root cause later is from somewhere else, the test can fail and it is OK. Every end-to-end test works that way - it checks for correct behavior not one particular root cause.

@artursouza
Copy link
Member

A test would be nice. I am not sure how to approach writing a test for this, do you have any suggestion?

@ItalyPaleAle We can have an integration test that does similar to what Artur did here #6517 (comment)

My biggest problem is not how to run the test, but how to write a test that reliably determines the absence of the effect. I can't just have a reminder or timer execute and count the goroutines #, because that can be affected by other things (there are background tasks that create goroutines randomly)

It is OK. Tests can fail due to another root cause later on. It is testing the symptom of go routines leaks by re-registering the same timer over and over again. If the root cause later is from somewhere else, the test can fail and it is OK. Every end-to-end test works that way - it checks for correct behavior not one particular root cause.

Also, in this case the number of go routines grows linear with the number of requests - it is hard to miss that in the test. Keep running the requests for some time and see if there are 1K go routines - it is a leak any time of the day when it happens.

@yaron2
Copy link
Member

yaron2 commented Jun 16, 2023

I can confirm this PR does not fix the memory leak issue. Running the steps outlined [here] (#6399 (comment)) which helped drive this memory leak fix shows the Dapr process starting at ~30Mb when idle and stays at ~85mb - ~95mb after the execution is completed.

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle
Copy link
Contributor Author

Just pushed some more changes:

  1. There was still some stuff leaking. The original patch fixed leaks when a reminder/timer was deleted, but not when it fired. Fixed all situations, including reminders/timers firing or reaching TTL.
  2. Added a unit test that checks goroutines before and after, validating reminders that are deleted, that fire (with no more repeats), and that reach TTL.

Hopefully the test doesn't turn out to be flaky... I wrote it as unit test because it allows having more control over what goroutines are created (E2E tests, besides being a lot more complex due to the need of parsing prometheus output, have a lot more going on and number of goroutines there is less stable) and because integration tests don't have placement service and don't have apps running.

@ItalyPaleAle
Copy link
Contributor Author

I can confirm this PR does not fix the memory leak issue. Running the steps outlined [here] (#6399 (comment)) which helped drive this memory leak fix shows the Dapr process starting at ~30Mb when idle and stays at ~85mb - ~95mb after the execution is completed.

@yaron2 please try with the latest fix, i don't think it was complete.

However please note that measuring memory that way (from Docker) isn't accurate because of the GC:

  1. Memory could be in use even if the objects are not in use - those objects will stick around until the next time the GC runs
  2. The GC doesn't return all memory to the OS after it runs

See: https://povilasv.me/prometheus-go-metrics/

Best would be to just check the number of active goroutines, since memory can't really be leaked in Go anyways.

@yaron2
Copy link
Member

yaron2 commented Jun 16, 2023

I can confirm this PR does not fix the memory leak issue. Running the steps outlined [here] (#6399 (comment)) which helped drive this memory leak fix shows the Dapr process starting at ~30Mb when idle and stays at ~85mb - ~95mb after the execution is completed.

@yaron2 please try with the latest fix, i don't think it was complete.

However please note that measuring memory that way (from Docker) isn't accurate because of the GC:

  1. Memory could be in use even if the objects are not in use - those objects will stick around until the next time the GC runs
  2. The GC doesn't return all memory to the OS after it runs

See: https://povilasv.me/prometheus-go-metrics/

Best would be to just check the number of active goroutines, since memory can't really be leaked in Go anyways.

I will check again with changes you're making to this PR, but surely I can't agree with the statement that Go can't have memory leaks, and I can certainly see memory being returned. With the fix mentioned earlier, the memory reported by Docker hung around at ~300Mb after the same test run, and after the durabletask-go fix it consistently dropped to ~100mb. Furthermore, memory snapshots showed that objects were indeed released from memory (timer Tickers) after the durabletask-go fix was applied, as opposed to snapshots before the fix that showed them still being in use.

@ItalyPaleAle
Copy link
Contributor Author

Memory leaks in Go are only possible in very limited cases, including incorrect use of global variables, or goroutine leaks. The issue with durabletask was a goroutine leak too.

As for how and when memory is returned, I recommend reading these:

@yaron2
Copy link
Member

yaron2 commented Jun 16, 2023

Memory leaks in Go are only possible in very limited cases, including incorrect use of global variables, or goroutine leaks. The issue with durabletask was a goroutine leak too.

As for how and when memory is returned, I recommend reading these:

I read these resources while fixing the durabletask-go issue. Whether this memory leak is caused by a goroutine leak or something else, it is clear based on the tests that this PR doesn't fix the issue, so we need to continue looking.

Here are some good reads of a memory leak example with time:

@artursouza
Copy link
Member

Memory leaks in Go are only possible in very limited cases, including incorrect use of global variables, or goroutine leaks. The issue with durabletask was a goroutine leak too.

As for how and when memory is returned, I recommend reading these:

Leak does not mean memory not used by the program, it means that there is a reference using the memory and that is stopping GC from returning it. This is similar problem in Java where "Java does not have memory leak" but it does if your program keeps references to object that will never be accessed again.

@ItalyPaleAle
Copy link
Contributor Author

I read these resources while fixing the durabletask-go issue. Whether this memory leak is caused by a goroutine leak or something else, it is clear based on the tests that this PR doesn't fix the issue, so we need to continue looking.

Please test with the latest changes. I reckon it wasn't complete when you posted your comment above (only fixed goroutine leaks when reminders were deleted, but not when they fired). It should be fine now, as validated by the tests.

@artursouza
Copy link
Member

@ItalyPaleAle can you have a E2E test to repro the mem leak? This way, it is easy to test with and without this change.

@yaron2
Copy link
Member

yaron2 commented Jun 19, 2023

I read these resources while fixing the durabletask-go issue. Whether this memory leak is caused by a goroutine leak or something else, it is clear based on the tests that this PR doesn't fix the issue, so we need to continue looking.

Please test with the latest changes. I reckon it wasn't complete when you posted your comment above (only fixed goroutine leaks when reminders were deleted, but not when they fired). It should be fine now, as validated by the tests.

Tested with latest changes, the memory leak still exists to the same effect.

@ItalyPaleAle
Copy link
Contributor Author

Could you share the code used for testing please? This way I can look into what's still going on.

@yaron2
Copy link
Member

yaron2 commented Jun 19, 2023

Could you share the code used for testing please? This way I can look into what's still going on.

Sure, already did here

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle
Copy link
Contributor Author

ItalyPaleAle commented Jun 19, 2023

Ok, it's fixed now. Timers were good, but reminders needed a nextTimer = nil in a few more places. For some reason, calling Stop() on a timer that has already fired returns false, which normally means the channel needs to be drained, but it then blocks forever causing a goroutine leak. So the variable has to be nullified instead.

I used the sample above, but also enabled prometheus and pprof. I can see now that the # of goroutines goes back to where it was before it started.

Here's the output of checking the number of goroutines:

bash -c "while true; do curl localhost:9091 -s | grep 'go_goroutines' | grep -Fv '#'; sleep 8; done" 
Output
go_goroutines 58
go_goroutines 59
go_goroutines 247
go_goroutines 229
go_goroutines 190
go_goroutines 249
go_goroutines 229
go_goroutines 190
go_goroutines 211
go_goroutines 90
go_goroutines 90
go_goroutines 89
go_goroutines 90
go_goroutines 89
go_goroutines 90
go_goroutines 90
go_goroutines 89
go_goroutines 76
go_goroutines 76
go_goroutines 77
go_goroutines 77
go_goroutines 76
go_goroutines 77
go_goroutines 76
go_goroutines 77
go_goroutines 77
go_goroutines 76
go_goroutines 76
go_goroutines 57
go_goroutines 57
go_goroutines 56
go_goroutines 56
go_goroutines 57
go_goroutines 57
go_goroutines 57
go_goroutines 56
go_goroutines 56
go_goroutines 57
go_goroutines 57
go_goroutines 57
go_goroutines 57
go_goroutines 57
go_goroutines 56
go_goroutines 57
go_goroutines 56
go_goroutines 55

Note it takes a while for this to go back to "base" level because of the TCP connections still open. pprof shows the goroutines wait for a while blocked on some netpool stuff. After the connections are closed (I would assume that's when the tcp keepalives time out), it goes back to "base" level

artursouza and others added 2 commits June 18, 2023 23:12
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@artursouza
Copy link
Member

Awesome. Is it possible to have an E2E test for this?

@yaron2
Copy link
Member

yaron2 commented Jun 19, 2023

After the latest changes, the memory is still held at > 105Mb when all reminders finish firing. This may have reduced the number of goroutines, but there are still unclaimed objects in memory.

@artursouza artursouza merged commit ca08aa9 into dapr:master Jun 20, 2023
14 of 15 checks passed
ItalyPaleAle added a commit to ItalyPaleAle/dapr that referenced this pull request Jun 20, 2023
* Fixed goroutine leak in reminders and timers

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Added unit tests + some more tweaks

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Fixed last goroutine leaks

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Comments

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Co-authored-by: Artur Souza <asouza.pro@gmail.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
ItalyPaleAle added a commit to ItalyPaleAle/dapr that referenced this pull request Jun 20, 2023
* Fixed goroutine leak in reminders and timers

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Added unit tests + some more tweaks

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Fixed last goroutine leaks

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Comments

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Co-authored-by: Artur Souza <asouza.pro@gmail.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
artursouza added a commit that referenced this pull request Jun 20, 2023
* Fixed goroutine leak in reminders and timers



* Added unit tests + some more tweaks



* Fixed last goroutine leaks



* Comments



---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Co-authored-by: Artur Souza <asouza.pro@gmail.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
artursouza added a commit that referenced this pull request Jun 20, 2023
* Fixed goroutine leak in reminders and timers



* Added unit tests + some more tweaks



* Fixed last goroutine leaks



* Comments



---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Co-authored-by: Artur Souza <asouza.pro@gmail.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
@ItalyPaleAle ItalyPaleAle added this to the v1.12 milestone Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoupdate DaprBot will keep the Pull Request up to date with master branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Runtime memory leak using Actor timers [Actor Reminders] Potential memory leak
5 participants