Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate other patterns for resilient consumers/connections #62

Open
mmmries opened this issue Jul 18, 2017 · 4 comments
Open

Investigate other patterns for resilient consumers/connections #62

mmmries opened this issue Jul 18, 2017 · 4 comments

Comments

@mmmries
Copy link
Collaborator

mmmries commented Jul 18, 2017

The current implementation of ConnectionSupervisor and ConsumerSupervisor are working, but I'm not sure they are idiomatic or ideal.

I had a conversation on elixir forum where I got some feedback about other possible approaches to this problem.

@brandynbennett
Copy link
Contributor

From NATS docs

All the client libraries maintained on the nats.io GitHub page will automatically attempt to re-connect if their current server connection gets disconnected for any reason. Upon re-connection the client library will automatically re-establish all the subscriptions, there is nothing for the application programmer to do.

Seems like Elixir lib should handle reconnect to meet expectations of other library languages. This is however a difficult problem with many edge cases.

@brandynbennett
Copy link
Contributor

brandynbennett commented Jul 20, 2022

Resilient Connection Challenges

https://docs.nats.io/using-nats/developer/connecting/reconnect

  1. Discover a disconnection and reconnect automatically
  2. Reconnect to any available server in the cluster
  3. Connection wait time
  • Avoid thrashing by trying to reconnect over and over really fast
  1. Max connection attempts
  • This includes the server provided to the client's connect call, as well as the server the client discovered through another server.
  1. Size of reconnection buffer
  2. Allow reconnect boolean
  3. Reconnect jitter
  4. TLS Reconnect
  5. Discovering all available servers
  6. Randomize which server is connected to, to avoid the thundering herd
  • Allow randomization to be turned off
  1. Notify application of reconnection
  2. A buffer to hold published messages during a disconnection
  • Automatically publish those messages when the connection is reestablished
  • Configure the size of the buffer
  • error when the buffer size is reached
  • Acknowledgements to ensure a message was actually delivered and not just buffered
  1. Slow Consumers
  • Limit income messages by count or bytes
  • Detect slow consumer
  • notify of dropped messages
  1. No responders

@mmmries Do know of any others or have comments on any of these?

@mmmries
Copy link
Collaborator Author

mmmries commented Oct 22, 2023

I had a conversation recently with @autodidaddict about the patterns in the current client, and we both agreed that our experiences of using the ConnectionSupervisor and ConsumerSupervisor patterns in several different production systems has resulted in very understandable tradeoffs.

The ConnectionSupervisor handles 1, 2, 3, 4, 5, 7, 8, 10. The ConsumerSupervisor handles 11 and the client itself support 14.

6 Allow reconnect boolean

This is essentially already support. If you use Gnat.start_link you can get a connection with no restarts enabled. If you use ConnectionSupervisor then you can configure a list of servers and get automatic restarts. It's not a boolean flag, but I think it's well supported here

9 Discovering all available servers

I took a stab at this myself (in #54) and just couldn't figure out how to get the server to push cluster membership information back down to the client. I think doing this well would require some change to the Gnat connection and then when new cluster membership data is available, it would pass it back to the ConnectionSupervisor (if applicable) so the supervisor can dynamically keep track of the possible servers it might want to attempt in the future.

12 Buffer to hold published messages during a disconnection

I would love to add this. I think it should be a separate supervised process that acts as a publishing buffer and handles things like acknowledgements. You could startup the publisher a lot like the ConsumerSupervisor where you tell it the name of the supervised connection and then give it some parameters like the maximum number of messages to buffer etc.

13 Slow Consumers

This is a very interesting idea to me, but I have no good ideas of how to approach it. I think it could be handled in the ConsumerSupervisor where we could have an incoming queue with a limited number of parallel tasks. Currently we spawn a task for every message as soon as it shows up in our inbox. We could provide a limited size to this queue which would be a good way to detect that we're not keeping up with the demand of our subscription topics. Lots of good ideas here https://docs.nats.io/running-a-nats-service/nats_admin/slow_consumers#slow-consumers-identified-in-the-client

A unified / simplified client

You can probably tell from the comments above that we've been designing the Elixir client as a sort of collection of tools rather than a single interface. The user has to know the different between calling Gnat.pub directly vs calling the buffered publisher for example.

I think it could be really valuable to have an approach similar to Ecto.Repo where the client application defines a module that calls use Gnat.Connection. This would be an all-in-one interface where it acts as a supervisor that starts a ConnectionSupervisor, ConsumerSupervisor, buffered pubilsher, etc and provides a single unified interface. This would give our community an easy place to start while also making it possible to assemble these tools in other ways that meet their needs.

@autodidaddict
Copy link
Collaborator

autodidaddict commented Oct 22, 2023

I definitely like the idea of having a nice clean, unified interface. I think this might be a good opportunity to look at how we can get JetStream, KV, and object store all into this client as a cohesive client library like the other NATS clients.

We'd want to make sure that whatever we end up with is still something that can be easily declared in the children list of a supervisor/application, as well as manually start_linked

Given the potentially expanding responsibilities of this core module, maybe we'd use Gnat.Client rather than Gnat.Connection, where a Gnat.Connection is just one aspect of the overall client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants