Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs and mention explicit close required change in v1.8.11 #445

Open
marekmartins opened this issue Apr 10, 2024 · 16 comments
Open

Comments

@marekmartins
Copy link

Problem

In our production systems we upgraded to latest patch version and we noticed excessive memory usage, seemed to be goroutine leak as far as I checked. Memory went up but when connections count went down, memory stayed roughly at the same level.

Probably on close etc it was not cleaned up properly, but I did not dig any further - for now we rolled back to version v1.8.10 and all is back to normal.

Our use-case:
We have roughly opened 50k websockets on peak time at day, 500 messages pushed per second, 30 new websockets made per second, 30 old websockets closed per second.

Also for each websocket connection we execute in separate goroutine PING-PONGer to prevent websockets closing for clients under some circumstances.

Maybe it's some edge case with our usage, but for us it seems regression introduced.
If I can help anyhow more then let me know 🙏

@nhooyr
Copy link
Owner

nhooyr commented Apr 10, 2024

Oh no :(

Can you post a goroutine trace of the leaked goroutines?

@marekmartins
Copy link
Author

maybe You can give me some pointers how I could do it easiest and in a form most useful for You?
then I try to look more into it in next days

@nhooyr
Copy link
Owner

nhooyr commented Apr 10, 2024

If you send your program a SIGQUIT it should dump a stack of all active goroutines. That would help lots.

If you cannot do that, you can use delve to attach to a running go program and get a trace of all goroutines.
See go-delve/delve#1475

I think net/http/pprof also supports a debug endpoint that prints a trace of all goroutines.

All equally useful options, please pick whichever is most convenient for you and get back to me as soon as possible.

@nhooyr
Copy link
Owner

nhooyr commented Apr 11, 2024

Also just to note we do ensure there are no hanging goroutines after running all tests. So whatever it is, it's something not coming up in the tests.

How do you know it's a goroutine leak actually? Could just be some memory being held somewhere indefinitely somehow.

@nhooyr
Copy link
Owner

nhooyr commented Apr 11, 2024

Actually idk how that could be either. Def need a trace or further info to debug this.

@marekmartins
Copy link
Author

marekmartins commented Apr 11, 2024

Tnx. I'll try to look into this.

I know this by Prometheus/Grafana metrics we have / are tracking.
Each night connections count drops from 50k to roughly 2k and together with it goroutines count aswell related to connections + related memory.

With newest patch version this drop was not there like it should be, it was pretty much flatlined.

EDIT
I deployed it to our test environment, but there is load so low that it takes litte time to get anything meaningful from there.
I will profile application there in next days and take goroutine profile using pprof and share file with You.

@marekmartins
Copy link
Author

marekmartins commented Apr 12, 2024

I let it run in test litte-bit and I already see patterns that goroutines tend to go up. I profiled it in a moment where there were 8 connections opened. We track it via Prometheus metrics and it's very precide.

I added part of goroutines profile here. I think for some reasons timeoutLoop goroutine does not always exit when connection is actually closed or somehow it get's stuck. It shows there 102, what seems to be way too off compared to 8 connections by other metrics. Also at night this number did not decrease.

I let it run more and will see if that number goes more up. On Monday I get newer numbers.

image

@nhooyr
Copy link
Owner

nhooyr commented Apr 12, 2024

What kind of goroutines profile is that? What do the percentages mean?

It would help tons if you could get the stacks of those timeoutLoop goroutines. I looked over the code and it isn't clear how it could be getting stuck

@marekmartins
Copy link
Author

marekmartins commented Apr 13, 2024

I'm using net/http/pprof package for that I have HTTP handlers what give back different profiles.
This concrete one is /debug/pprof/goroutine profile from that.

https://pkg.go.dev/net/http/pprof

I took new profile and now it's 123 with no actual connections.
image

So it's steadily growing in test environment.

I added latest prof file as zip. Unzip it and check it via command:
go tool pprof -http=:8080 test.cmh1-3.goroutine.prof

Then You can check it out via browser. I can also sent that snapshot from other profile types like heap, mutex etc. All what is supported by pprof package.

Let me know 🙏

Also next week I will debug timeoutLoop and try to see if I'm able to repeat the situation when those goroutines will not exit.

test.cmh1-3.goroutine.zip

@nhooyr
Copy link
Owner

nhooyr commented Apr 13, 2024

It's confusing but there are multiple goroutine profiles. Can you go to /debug/pprof/ in a browser and then click full goroutine stack dump? That's the one I need and it's just plaintext. The link should be /debug/pprof/goroutine?debug=2.

@nhooyr
Copy link
Owner

nhooyr commented Apr 13, 2024

One possible scenario in which I can see this behavior arising is if you're not calling c.Close/c.CloseNow on every connection after you're done with it. Before if an error occured on the connection like a read/write it would be automatically closed and cleaned up for you but this latest version removed that in favor of always requiring c.Close/c.CloseNow.

Though the docs have always read:

Be sure to call Close on the connection when you are finished with it to release associated resources.

So this behavior was never to be relied on. Though I can see how that's confusing with the next line:

On any error from any method, the connection is closed with an appropriate reason.

I'll remove this line from the docs.

@nhooyr
Copy link
Owner

nhooyr commented Apr 13, 2024

The quoted docs are from: https://godocs.io/nhooyr.io/websocket#Conn

@marekmartins
Copy link
Author

marekmartins commented Apr 14, 2024

You're right. If connection was closed from client side and main read loop ended because of that then on server side we did not explicitly close as without it also it worked and released resources on it's own.
If that's the case then sorry for the trouble 😞. I will add Close next week in this case as well and let here know that if it resolved the issue.

Also as maybe I'm not the only one and previous behavior was not intentional, but it still was there then would be wise to mention it in release notes also?
For me it seems important behavioral change to mention 😅

@nhooyr
Copy link
Owner

nhooyr commented Apr 14, 2024

Awesome! Yup will add to the release notes thanks @marekmartins.

@nhooyr nhooyr changed the title memory leak on latest patch v1.8.11 Update docs and mention behaviour change in v1.8.11 Apr 14, 2024
@nhooyr nhooyr changed the title Update docs and mention behaviour change in v1.8.11 Update docs and mention explicit close required change in v1.8.11 Apr 14, 2024
@marekmartins
Copy link
Author

Letting know that seems that adding explicit Close solved problem for us! Thank You :) Feel free to close this issue.

@nhooyr
Copy link
Owner

nhooyr commented Apr 17, 2024

Awesome good to hear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants