Investigate other patterns for resilient consumers/connections #62

mmmries · 2017-07-18T16:00:55Z

The current implementation of ConnectionSupervisor and ConsumerSupervisor are working, but I'm not sure they are idiomatic or ideal.

I had a conversation on elixir forum where I got some feedback about other possible approaches to this problem.

The text was updated successfully, but these errors were encountered:

brandynbennett · 2022-07-18T16:42:36Z

From NATS docs

All the client libraries maintained on the nats.io GitHub page will automatically attempt to re-connect if their current server connection gets disconnected for any reason. Upon re-connection the client library will automatically re-establish all the subscriptions, there is nothing for the application programmer to do.

Seems like Elixir lib should handle reconnect to meet expectations of other library languages. This is however a difficult problem with many edge cases.

brandynbennett · 2022-07-20T17:28:37Z

Resilient Connection Challenges

https://docs.nats.io/using-nats/developer/connecting/reconnect

Discover a disconnection and reconnect automatically
Reconnect to any available server in the cluster
Connection wait time

Avoid thrashing by trying to reconnect over and over really fast

Max connection attempts

This includes the server provided to the client's connect call, as well as the server the client discovered through another server.

Size of reconnection buffer
Allow reconnect boolean
Reconnect jitter
TLS Reconnect
Discovering all available servers
Randomize which server is connected to, to avoid the thundering herd

Allow randomization to be turned off

Notify application of reconnection
A buffer to hold published messages during a disconnection

Automatically publish those messages when the connection is reestablished
Configure the size of the buffer
error when the buffer size is reached
Acknowledgements to ensure a message was actually delivered and not just buffered

Slow Consumers

Limit income messages by count or bytes
Detect slow consumer
notify of dropped messages

No responders

@mmmries Do know of any others or have comments on any of these?

mmmries · 2023-10-22T07:57:00Z

I had a conversation recently with @autodidaddict about the patterns in the current client, and we both agreed that our experiences of using the ConnectionSupervisor and ConsumerSupervisor patterns in several different production systems has resulted in very understandable tradeoffs.

The ConnectionSupervisor handles 1, 2, 3, 4, 5, 7, 8, 10. The ConsumerSupervisor handles 11 and the client itself support 14.

6 Allow reconnect boolean

This is essentially already support. If you use Gnat.start_link you can get a connection with no restarts enabled. If you use ConnectionSupervisor then you can configure a list of servers and get automatic restarts. It's not a boolean flag, but I think it's well supported here

9 Discovering all available servers

I took a stab at this myself (in #54) and just couldn't figure out how to get the server to push cluster membership information back down to the client. I think doing this well would require some change to the Gnat connection and then when new cluster membership data is available, it would pass it back to the ConnectionSupervisor (if applicable) so the supervisor can dynamically keep track of the possible servers it might want to attempt in the future.

12 Buffer to hold published messages during a disconnection

I would love to add this. I think it should be a separate supervised process that acts as a publishing buffer and handles things like acknowledgements. You could startup the publisher a lot like the ConsumerSupervisor where you tell it the name of the supervised connection and then give it some parameters like the maximum number of messages to buffer etc.

13 Slow Consumers

This is a very interesting idea to me, but I have no good ideas of how to approach it. I think it could be handled in the ConsumerSupervisor where we could have an incoming queue with a limited number of parallel tasks. Currently we spawn a task for every message as soon as it shows up in our inbox. We could provide a limited size to this queue which would be a good way to detect that we're not keeping up with the demand of our subscription topics. Lots of good ideas here https://docs.nats.io/running-a-nats-service/nats_admin/slow_consumers#slow-consumers-identified-in-the-client

A unified / simplified client

You can probably tell from the comments above that we've been designing the Elixir client as a sort of collection of tools rather than a single interface. The user has to know the different between calling Gnat.pub directly vs calling the buffered publisher for example.

I think it could be really valuable to have an approach similar to Ecto.Repo where the client application defines a module that calls use Gnat.Connection. This would be an all-in-one interface where it acts as a supervisor that starts a ConnectionSupervisor, ConsumerSupervisor, buffered pubilsher, etc and provides a single unified interface. This would give our community an easy place to start while also making it possible to assemble these tools in other ways that meet their needs.

autodidaddict · 2023-10-22T13:15:47Z

I definitely like the idea of having a nice clean, unified interface. I think this might be a good opportunity to look at how we can get JetStream, KV, and object store all into this client as a cohesive client library like the other NATS clients.

We'd want to make sure that whatever we end up with is still something that can be easily declared in the children list of a supervisor/application, as well as manually start_linked

Given the potentially expanding responsibilities of this core module, maybe we'd use Gnat.Client rather than Gnat.Connection, where a Gnat.Connection is just one aspect of the overall client?

mmmries added the enhancement label Aug 23, 2019

QuinnWilton mentioned this issue Jun 22, 2021

Allow host to start before NATS wasmCloud/wasmcloud-otp#22

Closed

mmmries mentioned this issue Oct 23, 2023

Jetstream Support and Broadway Integration Officially #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate other patterns for resilient consumers/connections #62

Investigate other patterns for resilient consumers/connections #62

mmmries commented Jul 18, 2017

brandynbennett commented Jul 18, 2022

brandynbennett commented Jul 20, 2022 •

edited

mmmries commented Oct 22, 2023

autodidaddict commented Oct 22, 2023 •

edited

Investigate other patterns for resilient consumers/connections #62

Investigate other patterns for resilient consumers/connections #62

Comments

mmmries commented Jul 18, 2017

brandynbennett commented Jul 18, 2022

brandynbennett commented Jul 20, 2022 • edited

Resilient Connection Challenges

mmmries commented Oct 22, 2023

6 Allow reconnect boolean

9 Discovering all available servers

12 Buffer to hold published messages during a disconnection

13 Slow Consumers

A unified / simplified client

autodidaddict commented Oct 22, 2023 • edited

brandynbennett commented Jul 20, 2022 •

edited

autodidaddict commented Oct 22, 2023 •

edited