New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaf nodes do not receive messages from remote nodes properly #3009
Comments
I suspect the authorization block for the servers receiving leafnode connections may be incorrect. |
I encountered this problem again, this time I tried to delete only the Jetstream storage on the leaf node and the cluster returned to normal. This is hard for me to understand, if it's an authorization configuration error, then I should never be able to receive messages from the remote node in the leaf node. |
@LLLLimbo Would you mind sending some reports from the NATS cli tools ( |
The following information is obtained(from leafnode) when the cluster is working properly.
This is all the connections currently on the leaf node: {
"server_id": "NBSKY74RGOL5DQ32T77BU5J73GBL6MPGDDGUIJRYU54YQNABE2DMX6EE",
"now": "2022-04-06T23:48:20.834615611Z",
"num_connections": 4,
"total": 4,
"offset": 0,
"limit": 1024,
"connections": [
{
"cid": 7,
"kind": "Client",
"type": "mqtt",
"ip": "192.168.3.101",
"port": 52932,
"start": "2022-04-06T23:02:58.365921518Z",
"last_activity": "2022-04-06T23:47:55.062696094Z",
"rtt": "0s",
"uptime": "45m22s",
"idle": "25s",
"pending_bytes": 0,
"in_msgs": 91,
"out_msgs": 0,
"in_bytes": 2227,
"out_bytes": 0,
"subscriptions": 2,
"mqtt_client": "DEVICE_uQGgAMVp"
},
{
"cid": 21,
"kind": "Client",
"type": "mqtt",
"ip": "192.168.3.101",
"port": 52938,
"start": "2022-04-06T23:03:11.743418673Z",
"last_activity": "2022-04-06T23:47:53.791111975Z",
"rtt": "0s",
"uptime": "45m9s",
"idle": "27s",
"pending_bytes": 0,
"in_msgs": 91,
"out_msgs": 0,
"in_bytes": 2227,
"out_bytes": 0,
"subscriptions": 2,
"mqtt_client": "DEVICE_od5c2EpB"
},
{
"cid": 22,
"kind": "Client",
"type": "mqtt",
"ip": "192.168.3.101",
"port": 52944,
"start": "2022-04-06T23:03:13.318036621Z",
"last_activity": "2022-04-06T23:48:10.594085382Z",
"rtt": "0s",
"uptime": "45m7s",
"idle": "10s",
"pending_bytes": 0,
"in_msgs": 89,
"out_msgs": 0,
"in_bytes": 2181,
"out_bytes": 0,
"subscriptions": 2,
"mqtt_client": "DEVICE_1WQ7CrFJ"
},
{
"cid": 23,
"kind": "Client",
"type": "mqtt",
"ip": "192.168.3.101",
"port": 52948,
"start": "2022-04-06T23:03:40.668665408Z",
"last_activity": "2022-04-06T23:48:15.109230459Z",
"rtt": "0s",
"uptime": "44m40s",
"idle": "5s",
"pending_bytes": 0,
"in_msgs": 88,
"out_msgs": 0,
"in_bytes": 2158,
"out_bytes": 0,
"subscriptions": 2,
"mqtt_client": "DEVICE_67cFBdz-"
}
]
} |
I am guessing that |
There are indeed three MQTT streams
|
What are the server versions? Have you tried with the latest from main or nightly build? |
The version I am currently using is 2.7.3, but I had the same problem on 2.7.4. |
You are normally using MQTT clients, but for the issue you are reporting, this is core NATS, subscribing on "test" (or any other subject). What made you try that? Did you notice communication issues that made you try the nats sub? In the screen capture, it seems like you are simply subscribing on "test", not on a queue, which I know we have fixed an issue related to queue group and leafnodes (#2901). Anyway, thank you for all the details. Will try to reproduce but it may take time if, as you reported, it does not happen right away. |
In general, we only allow devices to connect to leaf nodes using MQTT. Ideally, our back-end application uses the NATS protocol to connect to a remote NATS node and then sends a message to the device, which receives it on the leaf node. Then we ran into the problem described in this issue, which we initially thought was related to MQTT, but I still ran into the problem after trying to connect to the leaf node using a NATS client. |
I will also continue to try to find the trigger conditions for the issue. |
@LLLLimbo I was not able to reproduce so far. What I would recommend is that you run with |
Thank you for your assistance! I will try to capture the relevant information in the logs. However, the problem hasn't happened since I last manually cleaned the jetstream storage on the leaf nodes and I need to keep waiting. |
@kozlovic |
@LLLLimbo I see the |
@kozlovic Sorry I forgot to state that what I did in the gif happened before I redeployed the entire cluster, so I'll try to collect a clean log again. |
|
Any timeframe you could give me when you created the SUB on test on the "edge" server and that you then tried to send (from which server) and that message was not received? |
This happens before I redeploy the whole cluster and I will try to reproduce and collect new logs for you. |
@kozlovic |
When I access the web monitor (http://192.168.3.131:8222/leafz) of 131 (i.e. the leaf node), it shows that the currently connected remote node is 136. {
"server_id": "NB73QCEZBAJ4U2ASAMURWZICPZXXKTM456RJGCLJH46BL6HSRWAISQIU",
"now": "2022-04-11T00:44:50.65748122Z",
"leafnodes": 1,
"leafs": [
{
"account": "SEEINER",
"ip": "192.168.3.136",
"port": 7422,
"rtt": "1.65919ms",
"in_msgs": 86,
"out_msgs": 1244,
"in_bytes": 14805,
"out_bytes": 69153,
"subscriptions": 58
}
]
} |
@LLLLimbo I am sorry, but downloading log-136.zip shows that this is the log of nats-edge, which is 131. The 134 and 135 logs correctly display logs for nodes nats-test-node-1 and nats-test-node-2 respectively, but 136 logs seem of the wrong server. |
My fault, I have re-uploaded. |
When using subscriptions through import/exports, the server with a leafnode connection would properly send the interest over, but if the connection is recreated, this would not happen. In case of JetStream where that happens under the cover, message flow would stop after the leafnode restart because the subscriptions would be created on recovery of the JetStream assets but *before* the LeafNode connection could be established. Resolves #3024 Resolves #3027 Resolves #3009 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
When using subscriptions through import/exports, the server with a leafnode connection would properly send the interest over, but if the connection is recreated, this would not happen. In case of JetStream where that happens under the cover, message flow would stop after the leafnode restart because the subscriptions would be created on recovery of the JetStream assets but *before* the LeafNode connection could be established. Resolves #3024 Resolves #3027 Resolves #3009 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@LLLLimbo I believe that the issue has to do with a leafnode restarting or recreating a connection and should be addressed in PR that has just been merged. It should be available in the nightly today or in the release v2.8.0 that should be coming out very shortly. |
@kozlovic |
@LLLLimbo I see that you did a config reload on 131 and auth users was reloaded. Could you describe what was changed? Is the repo https://github.com/LLLLimbo/nats-conf still a good place to see the previous config and you describe the change that was done? I could then try to reproduce and see if that is the reason for the issue. |
@kozlovic I did not change the configuration |
I saw that in the 131's logs:
Will see if a reload prints that without any change using your nats-conf repo. Thanks! |
Yes, I have verified that this is printed even without configuration modification. I will investigate. Thank you for your patience. |
I'm sure no changes have been made to the configuration file. |
Thank you also for your continued interest in this issue! I will always cooperate if there is any need. |
Well, that's embarrassing, I can easily reproduce with the config reload. I guess a v2.8.1 will be needed soon. As a workaround, you can restart the server that had a "reload" signal issued, and unlike the original issue, restarting the server will actually fix the interest propagation issue. Again, thank you for your patience and sorry for all the troubles. |
@kozlovic |
Weird, the test case I created for the original issue, when modified to do a config reload, does not show the problem. So will need more time to investigate. |
Final update for me today, I was actually able to reproduce. I had to modify the test to stop/restart the consumer too, not just the publisher. |
When a configuration reload is done, the account's leaf node connections were not transfered to the new instance of the account, causing the interest to not be propagated until a leafnode reconnect or a server restart. Resolves #3009 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Description
I started a NATS cluster with three remote nodes and one leaf node. They are configured as follows:
https://github.com/LLLLimbo/nats-conf/tree/main
At first they work fine, the leaf nodes receive the messages correctly when I publish them to the remote nodes and vice versa, but whenever they run for a while, it happens that the leaf nodes do not receive the messages from the remote nodes correctly.
The problem may be related to jetstream, when I delete the storage file of jetstream, the cluster is back to normal.2022-04-12 correction:A simple restart restores communication between the remote node and the leaf node.
Environment
The text was updated successfully, but these errors were encountered: