Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection/Channel Deadlock #32

Closed
melardev opened this issue Jan 3, 2022 · 9 comments
Closed

Connection/Channel Deadlock #32

melardev opened this issue Jan 3, 2022 · 9 comments

Comments

@melardev
Copy link

melardev commented Jan 3, 2022

Hi, I already covered created an issue for this on streadway/amqp#518 but then it was for streadway/amqp the issue still reproduces with this library so I decided to add the issue here too

I have created a sample project with few files that you have to download and run the main to reproduce the issue, the files are on:
https://gist.github.com/melardev/1b9c7e1b1a4ac37cb31e57dc6cde99c7

The code is bad, it is not thread-safe, but It was easier to reproduce my issue with that code rather than my real project that was doing effort to keep things thread-safe, the idea is to show the framework hanging yet the connection is already closed, it may get stuck in any function call, mainly QueueDeclare and channel.Close() but most of the times it gets stuck on channel.Close() the issue is for closing a channel it has to acquire a Lock https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L640, which is already held here https://github.com/rabbitmq/amqp091-go/blob/main/connection.go#L408, basically the connection is waiting the channel to be closed but the channel::shutdown() never exits its main loop, even if the connection is closed ...

Selection_389

@melardev
Copy link
Author

melardev commented Jan 3, 2022

The two danger zones:

I replaced both with a select statement to see if it runs great, I assumed it would break something else, I don't know if it did, but at least it does not hang anymore ....

select {
  case <-time.After(time.Second * 5):
    break
  case c <- err:
    break
  }

@melardev
Copy link
Author

melardev commented Jan 3, 2022

I think that the issue is the channel that is notified about channel close is no longer being used to catch the amqpError in my code, at that time the channel has been initiated ... and so notifying a channel that is not listening in Golang hangs ... very hard problem in Golang, I am not sure if you would ever fix this, or tag it as working as expected, but well ... IMO frameworks should never hang forever even if they are misused.

@Zerpet
Copy link
Contributor

Zerpet commented Jan 4, 2022

Thank you for reporting this issue. Deadlocks are something worth looking into. I'll keep this issue around to have a look at some point in the coming days.

@andygrunwald
Copy link
Contributor

This issue is very close to #18.
Both describe the same behavior.

@remster
Copy link

remster commented Feb 8, 2022

Up-voting - burnt 2hrs yesterday on this. Might give client ability to unsubscribe, but best relieve them and avoid the deadlock by non-blocking write as suggested above.

@melardev
Copy link
Author

melardev commented Feb 8, 2022

@remster You can not unsubscribe, as soon as you call NotifyClose, the framework will take that chan, save it in its internal slice, and notify it when the RabbitMQ channel is closed, no matter what, you better have a listener on that chan otherwise the framework will be signaling an "orphan" chan leading to a deadlock.

@malc0mn
Copy link

malc0mn commented Feb 19, 2022

Also running into this one and it really is the same as #18.
As already suggested in #18, I would also make the library notification code non blocking by using the idiomatic non blocking select:

select {
case c <- err:
default:
}

I think the amqp091-go library should not deadlock at all in this case. Not causing a deadlock is definitely preferred over caring if the user of the library receives proper alerts on shutdown, that really should be the users care but the documentation/examples should explain that.

If the idea is that the user of the library must properly subscribe to the connClose and chanClose notifications, then we should somehow implement these checks in the library so that the user will be made aware of this expectation.
But not deadlocking still seems preferable to me.

Edit: beginning to get a better understanding of the issue. With this new insight, what I said above is not the right solution. The examples should be better on how to handle things. I'll make sure to post some code when I'm done testing.

@DanielePalaia
Copy link
Contributor

As for comment in #18, I'll cose this one as the issue seems the same.

@DanielePalaia
Copy link
Contributor

As for comment in #18, I'll cose this one as the issue seems the same.

Zerpet added a commit that referenced this issue Aug 18, 2022
The example function Example_consume() shows how to write a consumer
with reconnection support.

This commit also changes the notification channels to be buffered with
capacity of 1. This is important to avoid potential deadlocks during an
abnormal disconnection. See #32 and #18 for more details.

Both example functions now have a context with timeout, so that they
don't run forever. The QoS of 1 is set to slowdown the consumption; this
is helpful to test the reconnection capabilities, by giving enough time
to close the connection via the Management UI (or any other means).

Signed-off-by: Aitor Perez Cedres <acedres@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants