Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 2.4.0 upgraded to version 4.3.0 has a communication anomaly problem in 4G, but the old version has normal communication in 4G. #3015

Open
wulinnan opened this issue Nov 22, 2023 · 11 comments

Comments

@wulinnan
Copy link

Version 2.4.0 upgraded to 4.3.0 has a communication anomaly problem in 4G, the printout from the terminal:

[2023/11/22 03:11:04:3514] N: rops_handle_POLLIN_netlink: DELADDR [2023/11/22 03:11:04:8905] N: [wsicli|0|WS/h1/default/server_ip]: _lws_route_check_wsi: source client_ip gone

I'm not sure exactly what the problem is, at first I thought it was a 4G network issue, but why wouldn't the old libraries show this message in the terminal.
I went to check the function rops_handle_POLLIN_netlink and realized that this is where the problem is supposed to appear, the original library is not supposed to have it, not sure if my reasoning is correct?
Then what I want to know at the moment is if after this message appeared, I found out that the event LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL was triggered, and immediately after that the ws context was destroyed, what does this event mean exactly, and what should I do if I encountered this event, is it to go and re-create the context to create the connection What exactly does this event mean, and what should I do if I encounter this event?

@lws-team
Copy link
Member

This is to do with the new Netlink support, it's a good idea to use either latest stable branch v4.3-stable or main to get bugfixes (since different kinds of network interface do different things on Netlink).

@wulinnan
Copy link
Author

I've upgraded to the latest library, but this still occurs, and every time I go four or five minutes apart, the message prints as described above.
And at this point I go to check the strength of the 4G signal has not changed, check the network port address and it has not changed, so why are these two messages printing.
Can you provide me with a few scenarios where these two functions will be triggered to print out these two messages so that I can troubleshoot the problem?
I presume the reason this doesn't happen with older versions of the library is that there are no checks for these two functions. However, I would still like a clear explanation as to why these two functions print out these two messages.
I implore you to answer my query!

@lws-team
Copy link
Member

You're using main lws?

The netlink support in lws monitors changes to network interfaces, eg, addition, removal, adding interface address etc.

Each netwrork device does different things in different orders accordingto its driver operations and, eg, dhcp or NetworkManager.

lws is trying to understand if routes for existing connections have changed, so it can immediately close them.

If you werefine without it, you can turn it off by cmake -DLWS_WITH_NETLINK=0

@wulinnan
Copy link
Author

It is now turned off because netlink is too sensitive to 4G network signal fluctuations. But another problem has arisen, now after turning 4G off, it takes 4-5 minutes for the client shutdown event to be triggered, when netlink was still on, it only took 1 minute for the client shutdown to be triggered.
Can you please explain more about this issue, is it possible that when netlink is turned off, it doesn't give as much timely feedback about such fluctuations in the network?

@lws-team
Copy link
Member

is it possible that when netlink is turned off, it doesn't give as much timely feedback about such fluctuations in the network?

Yes... that's why I did all the work with netlink to find out if any connection has had something the connection establishment relied on removed.

It's not 'fluctuations' but concrete client network apis being called to manage the connection. The exact sequence of netlink commands seen by the kernel (and lws) differ according to what the network interface is and what software is managing it. If you want to make it work with your network device (or whatever manages it) you need to find out from the logs what sequence of netlink events makes trouble.

For example you mention DELADDR in your first comment - this is the removal of an address from an interface, if the connection establishment relied on that address, lws understands the connection has had it.

What else went on around that... it has a quirk to remove the address and add it straight back? Or is lws justified to understand the connection is dead?

@wulinnan
Copy link
Author

I'm not sure exactly why, as I'm not in charge of the development of the 4G network. However, I used the ping tool, and when set to send ping packets every 100ms, at that point the ping fails when the terminal prints DELADDR. But if I set it to send a ping packet after receiving a ping packet reply, when the terminal prints DELADDR, the ping is still normal.

@lws-team
Copy link
Member

As I said...

If you want to make it work with your network device (or whatever manages it) you need to find out from the logs what sequence of netlink events makes trouble.

Yes there's a DELADDR and lws understands the connection is broken by that. Why (explained from the sequence of netlink events around that) is lws wrong? The DELADDR applies to an unrelated interface source address? The DELADDR is immediately countermanded by adding the address back? Please show the netlink sequence that lws acts wrongly against.

I can't debug it since I don't have the device or the problem.

@wulinnan
Copy link
Author

Thank you very much for your reply, about the 4G occurrence triggering DELADDR has been sent to the 4G module head to investigate.

Currently there is the following phenomenon, there is a way to cut off the network, that is, the device is connected to the router, the router is able to connect to the extranet, at this time, I suddenly disconnect the router from the extranet, at this time, the terminal will appear rops_handle_POLLIN_netlink: DELADDR , and then after three minutes, before triggering LWS_CALLBACK_ CLENT_CLOSED event and LWS_CALLBACK_WSI_DESTROY event.

My question about this phenomenon is why there is a three-minute gap between triggering the client shutdown event and the ws context destruction event and disconnecting the network, after the other method of disconnecting the network that I used, which was to disconnect the device from the router, when only one minute passed between disconnecting the network and triggering the shutdown event. So at the moment would like to know what is causing this, can this problem be solved and how do I properly explain this phenomenon to others?

@lws-team
Copy link
Member

If you consider the typical situation that the outage happens one or more steps outside the device, there is no "event" at the device about that.

For example if the ethernet cable is pulled out of WAN side of router the device is connected to, there is not even an ethernet PHY event or WLAN association loss at the device to represent that. So it is normal that the device must use timeouts / ping + pongs to discover it has lost downstream connectivity.

Not sure I understand your situation but if your device somehow happens to combines the router and its upstream interface, you may see a netlink event for the router loss of its address (netlink reports what's happening over the whole system), but there is no knowledge on the device side that this event on some other interface than connections routed to impacts any of those connections.

@wulinnan
Copy link
Author

So can I understand that it takes three minutes to disconnect, when in fact it takes about three minutes for /ping + pongs to realize they've lost their downstream connection.

But it seems that on version 2.4.0, the same disconnect network scenario takes a minute or so to realize the disconnection and trigger the shutdown event, so I'd like to venture a guess if this timeout mechanism has gotten longer since the library upgrade?

What do you think about this idea of mine?

@lws-team
Copy link
Member

What do you think about this idea of mine?

A lot of things are getting mixed together here... lws sits on top of your platform network stack, and the tls stack, they also have their own transmits and timeouts and can force a connection closed if they don't get a response according to their own rules. Even intermediaries further along can enforce their own timeout, and the server side also gets to have an opinion and can issue its own ping probes.

Netlink just observes network configurations and states changing and reports them. Something else has to understand that a change of a certain kind on a specific interface impacts established connections and implies they should be closed. We don't any more seem to be talking about netlink but changes in network state one hop away that can't be directly observed from the device. I have no idea if we should be discussng netlink or not.

If it's no longer about netlink, modern lws has some control over its ping / pong triggering. Take a look at ./minimal-examples-lowlevel/ws-client/minimal-ws-client-binance/main.c and how info.retry_and_idle_policy is set at context creation time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants