Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a better default for idle_connection_timeout #4912

Open
thomaseizinger opened this issue Nov 22, 2023 · 12 comments · May be fixed by #4967
Open

Set a better default for idle_connection_timeout #4912

thomaseizinger opened this issue Nov 22, 2023 · 12 comments · May be fixed by #4967
Assignees

Comments

@thomaseizinger
Copy link
Contributor

As discussed yesterday in #4905.

Needs:

  • Decide on a better default (!= 0 seconds)
  • Update of all examples
  • Some docs on the method explaining what it does any why
@dariusc93
Copy link
Contributor

A default of 10 seconds may be enough to accommodate majority of the protocols, but setting to 30 seconds may be the best balance in my opinion

@thomaseizinger
Copy link
Contributor Author

30 seconds seems like a lot to me. The main usecase that we talked about yesterday was to be able to initiate protocols with peers from outside the Swarm based on events like kademlia queries. Those would usually happen very fast and a default timeout of 5 seconds would easily accommodate for that.

Anything longer feels too application-specific but I don't have any data to back that up.

@thomaseizinger
Copy link
Contributor Author

An alternative would be to entirely remove auto-closing and instead emit an event notifying the user that the connection is idle. Would that perhaps be useful?

@thomaseizinger thomaseizinger linked a pull request Dec 1, 2023 that will close this issue
4 tasks
@dariusc93
Copy link
Contributor

An alternative would be to entirely remove auto-closing and instead emit an event notifying the user that the connection is idle. Would that perhaps be useful?

That actually doesnt sound like a bad idea. Kind of curious to see how that would play out long term.

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented Dec 1, 2023

An alternative would be to entirely remove auto-closing and instead emit an event notifying the user that the connection is idle. Would that perhaps be useful?

That actually doesnt sound like a bad idea. Kind of curious to see how that would play out long term.

Typically, the problem with these kind of approaches is that by default, users will only handle the events that the care about and _ => {} match the rest. So an event ConnectionIsIdle will likely go unnoticed for many of them.

In this particular case though, that doesn't actually do any harm! (Until you run into connection limits for example). At that point however, you are probably familiar enough with rust-libp2p to learn, that this event exists and you can take whatever action you prefer.

To make the simple case of a timeout easier, we could re-emit this event every X seconds and include, how long the connection is already idle. Then users wouldn't need to keep any state outside of the Swarm if they e.g. wanted to close connections after them being idle for say 10 seconds.

Curious to hear what @mxinden thinks.

@thomaseizinger
Copy link
Contributor Author

I find it quite difficult to gauge what would be considered a bigger annoyance / "rust-libp2p"-ism:

  • Auto-closing connections when they are idle (regardless how long the timeout is)
  • Having to close idle connections yourself

From a library PoV, the latter is "better" because it gives the user more power over the policy on when to close connections. At the same time, not closing them at all might be a weird default?

We could add ConnectionEvent::StateChange { idle: bool } which would allow us to add a module that can auto-close connections once they are idle for X duration.

@DougAnderson444
Copy link
Contributor

I'm curious what amount of resources open connections use. Are they cheap or expensive? I would think that would influence the decision a bit?

@thomaseizinger
Copy link
Contributor Author

On @mxinden kademlia-exporter node, a single connection uses about 300kb. I'd expect an idle connection, even with more protocols enabled to not consume much more than that because the presence of buffered data would mean that the connection is not idle. Hence, all buffers should be empty on an idle connection, unless we have a memory leak somewhere.

In addition to that, an open connection consumes a file descriptor (or equivalent depending on OS).

@maqi
Copy link
Contributor

maqi commented Dec 12, 2023

couple of things might related to this idle connection handling, that we observed from our large scaled network (over 4k nodes) using kad + gossipsub :
1, the connection that being openned (via get_closest_peer or put/get record ops or by gossipsub) remains alive and not get auto closed , even with idel_connection_timeout got setup via with_idle_connection_timeout .
My guess of that is: the connects was openned for network query like get_closest, but as there is gossipsub that will periodicaly contact connected peers, those openned connection then remains non-ilde even the original network query ops got completed.
2, to keep a live connection, the per connection mem_usage is not much (around 50-150KB per connection).
But with the point 1, with large scaled testnet, we observed nodes holding over 3k connections and consuming 400MB.
3, We have to deploy a mechanism to manually close the outdated (has been established for a while) connections from not in kbuckets peers
This reduces the mem_usage a lot as live connections is capped by number of peers in the RT.
4, The gossipsub turns out to be rely on the live connections (instead of local peer knowledge + establish connection on request) to ensure a message reaches all subscribers.
We noticed there is no loss of topic msg when nodes keep all connections live, but will lost some topic msg when close some non-RT connections

@dariusc93
Copy link
Contributor

1, the connection that being openned (via get_closest_peer or put/get record ops or by gossipsub) remains alive and not get auto closed , even with idel_connection_timeout got setup via with_idle_connection_timeout .
My guess of that is: the connects was openned for network query like get_closest, but as there is gossipsub that will periodicaly contact connected peers, those openned connection then remains non-ilde even the original network query ops got completed.

To my knowledge, if the connected peers are subscribed to the same gossipsub topic on the node, their connection will be kept alive until they are no longer apart of the mesh (eg unsubscribed from the topic). Not sure if this is applicable for your use case though.

2, to keep a live connection, the per connection mem_usage is not much (around 50-150KB per connection).
But with the point 1, with large scaled testnet, we observed nodes holding over 3k connections and consuming 400MB.

Just being curious, is this with just TCP or TCP + Quic setup?

@maqi
Copy link
Contributor

maqi commented Dec 12, 2023

if the connected peers are subscribed to the same gossipsub topic on the node

No, most of the nodes are not subscribed to any topic. Only 2% of the nodes subscribed to the same topic.
i.e. with 4k nodes, only 80 nodes are within gossipsub.
However, gossipsub protocol does mention it will broadcast to non-subscribed connected peers as I understand.

is this with just TCP or TCP + Quic setup?

just TCP

@thomaseizinger
Copy link
Contributor Author

An active gossipsub stream will keep the connection alive even if the nodes aren't subscribed to the same topics. There isn't really much we can do about that. The best solution is to (as you are already doing), close those connections manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants