Leaf nodes do not receive messages from remote nodes properly #3009

LLLLimbo · 2022-04-06T06:47:51Z

Description

I started a NATS cluster with three remote nodes and one leaf node. They are configured as follows:
https://github.com/LLLLimbo/nats-conf/tree/main
At first they work fine, the leaf nodes receive the messages correctly when I publish them to the remote nodes and vice versa, but whenever they run for a while, it happens that the leaf nodes do not receive the messages from the remote nodes correctly.

~~The problem may be related to jetstream, when I delete the storage file of jetstream, the cluster is back to normal.~~
2022-04-12 correction：A simple restart restores communication between the remote node and the leaf node.

Environment

nats-version : 2.7.3
system: centos 7.6

derekcollison · 2022-04-06T11:42:19Z

I suspect the authorization block for the servers receiving leafnode connections may be incorrect.

LLLLimbo · 2022-04-06T23:04:24Z

I suspect the authorization block for the servers receiving leafnode connections may be incorrect.

I encountered this problem again, this time I tried to delete only the Jetstream storage on the leaf node and the cluster returned to normal. This is hard for me to understand, if it's an authorization configuration error, then I should never be able to receive messages from the remote node in the leaf node.

kozlovic · 2022-04-06T23:31:40Z

@LLLLimbo Would you mind sending some reports from the NATS cli tools (nats) to see if you have some streams defined, how their configuration look like, etc? I believe you are using MQTT too, are you having MQTT applications subscribing to "test"? Usually, how long does it take before the failure start to occur? Any servers restart before it starts to fail?

LLLLimbo · 2022-04-06T23:47:04Z

@LLLLimbo Would you mind sending some reports from the NATS cli tools (nats) to see if you have some streams defined, how their configuration look like, etc? I believe you are using MQTT too, are you having MQTT applications subscribing to "test"? Usually, how long does it take before the failure start to occur? Any servers restart before it starts to fail?

I didn't define any jetstream streams.
I don't have any MQTT client subscribes to test
I can't give a specific time, but usually after a few hours
There is no reboot

The following information is obtained（from leafnode） when the cluster is working properly.

PS C:\DevelopmentKit\nats\nats-cli> ./nats --server nats://seeiner-edge:seeiner123@192.168.3.131:4222 account info
Connection Information:

               Client ID: 25
               Client IP: 192.168.9.251
                     RTT: 2.2613ms
       Headers Supported: true
         Maximum Payload: 1.0 MiB
       Connected Cluster: nats-edge
           Connected URL: nats://seeiner-edge:seeiner123@192.168.3.131:4222
       Connected Address: 192.168.3.131:4222
     Connected Server ID: NBSKY74RGOL5DQ32T77BU5J73GBL6MPGDDGUIJRYU54YQNABE2DMX6EE
   Connected Server Name: nats-edge

JetStream Account Information:

           Memory: 0 B of Unlimited
          Storage: 816 B of Unlimited
          Streams: 3 of Unlimited
        Consumers: 1 of Unlimited

PS C:\DevelopmentKit\nats\nats-cli> ./nats --server nats://seeiner-edge:seeiner123@192.168.3.131:4222 stream ls
No Streams defined, pass -a to include system streams

This is all the connections currently on the leaf node:

{
  "server_id": "NBSKY74RGOL5DQ32T77BU5J73GBL6MPGDDGUIJRYU54YQNABE2DMX6EE",
  "now": "2022-04-06T23:48:20.834615611Z",
  "num_connections": 4,
  "total": 4,
  "offset": 0,
  "limit": 1024,
  "connections": [
    {
      "cid": 7,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52932,
      "start": "2022-04-06T23:02:58.365921518Z",
      "last_activity": "2022-04-06T23:47:55.062696094Z",
      "rtt": "0s",
      "uptime": "45m22s",
      "idle": "25s",
      "pending_bytes": 0,
      "in_msgs": 91,
      "out_msgs": 0,
      "in_bytes": 2227,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_uQGgAMVp"
    },
    {
      "cid": 21,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52938,
      "start": "2022-04-06T23:03:11.743418673Z",
      "last_activity": "2022-04-06T23:47:53.791111975Z",
      "rtt": "0s",
      "uptime": "45m9s",
      "idle": "27s",
      "pending_bytes": 0,
      "in_msgs": 91,
      "out_msgs": 0,
      "in_bytes": 2227,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_od5c2EpB"
    },
    {
      "cid": 22,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52944,
      "start": "2022-04-06T23:03:13.318036621Z",
      "last_activity": "2022-04-06T23:48:10.594085382Z",
      "rtt": "0s",
      "uptime": "45m7s",
      "idle": "10s",
      "pending_bytes": 0,
      "in_msgs": 89,
      "out_msgs": 0,
      "in_bytes": 2181,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_1WQ7CrFJ"
    },
    {
      "cid": 23,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52948,
      "start": "2022-04-06T23:03:40.668665408Z",
      "last_activity": "2022-04-06T23:48:15.109230459Z",
      "rtt": "0s",
      "uptime": "44m40s",
      "idle": "5s",
      "pending_bytes": 0,
      "in_msgs": 88,
      "out_msgs": 0,
      "in_bytes": 2158,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_67cFBdz-"
    }
  ]
}

kozlovic · 2022-04-06T23:48:53Z

I am guessing that nats stream ls -a will list the 3 MQTT streams, and nothing more, right?

LLLLimbo · 2022-04-06T23:50:04Z

There are indeed three MQTT streams

PS C:\DevelopmentKit\nats\nats-cli> ./nats --server nats://seeiner-edge:seeiner123@192.168.3.131:4222 stream ls -a
╭───────────────────────────────────────────────────────────────────────────────────╮
│                                      Streams                                      │
├─────────────┬─────────────┬─────────────────────┬──────────┬───────┬──────────────┤
│ Name        │ Description │ Created             │ Messages │ Size  │ Last Message │
├─────────────┼─────────────┼─────────────────────┼──────────┼───────┼──────────────┤
│ $MQTT_msgs  │             │ 2022-04-06 23:04:58 │ 0        │ 0 B   │ never        │
│ $MQTT_rmsgs │             │ 2022-04-06 23:04:58 │ 0        │ 0 B   │ never        │
│ $MQTT_sess  │             │ 2022-04-06 23:04:58 │ 4        │ 816 B │ 45m47s       │
╰─────────────┴─────────────┴─────────────────────┴──────────┴───────┴──────────────╯

kozlovic · 2022-04-06T23:54:41Z

What are the server versions? Have you tried with the latest from main or nightly build?

LLLLimbo · 2022-04-06T23:56:33Z

What are the server versions? Have you tried with the latest from main or nightly build?

The version I am currently using is 2.7.3, but I had the same problem on 2.7.4.

kozlovic · 2022-04-07T00:00:23Z

You are normally using MQTT clients, but for the issue you are reporting, this is core NATS, subscribing on "test" (or any other subject). What made you try that? Did you notice communication issues that made you try the nats sub? In the screen capture, it seems like you are simply subscribing on "test", not on a queue, which I know we have fixed an issue related to queue group and leafnodes (#2901).

Anyway, thank you for all the details. Will try to reproduce but it may take time if, as you reported, it does not happen right away.

LLLLimbo · 2022-04-07T00:05:38Z

In general, we only allow devices to connect to leaf nodes using MQTT. Ideally, our back-end application uses the NATS protocol to connect to a remote NATS node and then sends a message to the device, which receives it on the leaf node. Then we ran into the problem described in this issue, which we initially thought was related to MQTT, but I still ran into the problem after trying to connect to the leaf node using a NATS client.

LLLLimbo · 2022-04-07T00:10:46Z

I will also continue to try to find the trigger conditions for the issue.

kozlovic · 2022-04-07T21:37:35Z

@LLLLimbo I was not able to reproduce so far. What I would recommend is that you run with -DV (debug and trace) and make sure that your log are not limited (as they are now to 100MB) to make sure we capture the lack of interest propagation (if this is is what is happening). Thank you for your cooperation. (I will still continue to see if I can reproduce, but does not seem to be an obvious issue).

LLLLimbo · 2022-04-07T23:19:15Z

@LLLLimbo I was not able to reproduce so far. What I would recommend is that you run with -DV (debug and trace) and make sure that your log are not limited (as they are now to 100MB) to make sure we capture the lack of interest propagation (if this is is what is happening). Thank you for your cooperation. (I will still continue to see if I can reproduce, but does not seem to be an obvious issue).

Thank you for your assistance! I will try to capture the relevant information in the logs. However, the problem hasn't happened since I last manually cleaned the jetstream storage on the leaf nodes and I need to keep waiting.

LLLLimbo · 2022-04-07T23:41:07Z

@kozlovic
I have a log of the leaf node here, the first reboot occurred on 2022/04/07 07:02:54, the purpose of the reboot was to verify that deleting only the leaf node jetstream storage would bring communication back to normal. After this reboot, as of now there are no problems at the moment
logs.zip
.

kozlovic · 2022-04-07T23:52:03Z

@LLLLimbo I see the SUB test 1 and the sent of interest, and then receiving the message. So as you say, here everything is working fine. If you can reproduce, I would need the logs of ALL servers, to see the interest propagation (or lack thereof) and the log from the server the publisher publishes to.
I understand that it may take some time to get it to fail and logs may become big, so if you have to recycle them (or keep the size limit) I understand, hopefully it will still cover whatever event that initiated the issue.

LLLLimbo · 2022-04-08T00:11:07Z

@kozlovic
This is the full log . https://github.com/LLLLimbo/nats-conf/releases/tag/1.0
The remote node I have also performed a manual reboot, but the purpose of the reboot was to try to reproduce the issue.
The problem occurred between the time NATS started and 2022/04/07 07:02:54 when I restarted the leaf node

Sorry I forgot to state that what I did in the gif happened before I redeployed the entire cluster, so I'll try to collect a clean log again.

LLLLimbo · 2022-04-08T00:17:35Z

Delaying PING due to client activity appears frequently in the logs of the leaf nodes, I wonder if this is the possible cause.

kozlovic · 2022-04-08T00:40:47Z

Any timeframe you could give me when you created the SUB on test on the "edge" server and that you then tried to send (from which server) and that message was not received?

LLLLimbo · 2022-04-08T04:36:59Z

Any timeframe you could give me when you created the SUB on test on the "edge" server and that you then tried to send (from which server) and that message was not received?

This happens before I redeploy the whole cluster and I will try to reproduce and collect new logs for you.

LLLLimbo · 2022-04-11T00:41:26Z

@kozlovic
I encountered the problem again, but this time it took two days.
You can see in log 131 that I subscribed to downwards.device.msg.202204110827 on 2022-04-11 08:27:26, and in log 134 that I publisha message to that subject on 2022-04-11 08:27:28.
The cluster was fine until yesterday morning (you can see on the 131 log that I subscribed to downwards.device.msg on 2022/04/10 09:53:12)
https://github.com/LLLLimbo/nats-conf/releases/tag/2022-04-11

LLLLimbo · 2022-04-11T00:47:47Z

When I access the web monitor (http://192.168.3.131:8222/leafz) of 131 (i.e. the leaf node), it shows that the currently connected remote node is 136.

{
  "server_id": "NB73QCEZBAJ4U2ASAMURWZICPZXXKTM456RJGCLJH46BL6HSRWAISQIU",
  "now": "2022-04-11T00:44:50.65748122Z",
  "leafnodes": 1,
  "leafs": [
    {
      "account": "SEEINER",
      "ip": "192.168.3.136",
      "port": 7422,
      "rtt": "1.65919ms",
      "in_msgs": 86,
      "out_msgs": 1244,
      "in_bytes": 14805,
      "out_bytes": 69153,
      "subscriptions": 58
    }
  ]
}

kozlovic · 2022-04-11T17:15:11Z

@LLLLimbo I am sorry, but downloading log-136.zip shows that this is the log of nats-edge, which is 131. The 134 and 135 logs correctly display logs for nodes nats-test-node-1 and nats-test-node-2 respectively, but 136 logs seem of the wrong server.

LLLLimbo · 2022-04-11T22:49:17Z

@LLLLimbo I am sorry, but downloading log-136.zip shows that this is the log of nats-edge, which is 131. The 134 and 135 logs correctly display logs for nodes nats-test-node-1 and nats-test-node-2 respectively, but 136 logs seem of the wrong server.

My fault, I have re-uploaded.
Please note that I did not clear the previous logs of 136 when redeploying the cluster (2022/04/08 12:38:49).

LLLLimbo · 2022-04-12T01:35:22Z

@kozlovic
And a correction to my previous statement:
A simple restart restores communication between the remote node and the leaf node.
But this leads to another problem, so I submitted another issue #3027

When using subscriptions through import/exports, the server with a leafnode connection would properly send the interest over, but if the connection is recreated, this would not happen. In case of JetStream where that happens under the cover, message flow would stop after the leafnode restart because the subscriptions would be created on recovery of the JetStream assets but *before* the LeafNode connection could be established. Resolves #3024 Resolves #3027 Resolves #3009 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic · 2022-04-13T19:25:03Z

@LLLLimbo I believe that the issue has to do with a leafnode restarting or recreating a connection and should be addressed in PR that has just been merged. It should be available in the nightly today or in the release v2.8.0 that should be coming out very shortly.

LLLLimbo · 2022-04-20T00:25:50Z

@LLLLimbo I believe that the issue has to do with a leafnode restarting or recreating a connection and should be addressed in PR that has just been merged. It should be available in the nightly today or in the release v2.8.0 that should be coming out very shortly.

@kozlovic
After I updated to version 2.8.0, I still encountered this problem and I re-uploaded the log:
https://github.com/LLLLimbo/nats-logs/releases/tag/2022-04-20
You can see on the log 131： I subscribed to downwards.device.msg.202204200812 on the leaf node on 2022/04/20 08:12:21
On the log 136 : I sent a message to downwards.device.msg.202204200812 on 2022/04/20 08:12:25

kozlovic · 2022-04-20T00:34:35Z

@LLLLimbo I see that you did a config reload on 131 and auth users was reloaded. Could you describe what was changed? Is the repo https://github.com/LLLLimbo/nats-conf still a good place to see the previous config and you describe the change that was done? I could then try to reproduce and see if that is the reason for the issue.

LLLLimbo · 2022-04-20T00:38:37Z

@kozlovic I did not change the configuration

kozlovic · 2022-04-20T00:40:03Z

I saw that in the 131's logs:

[25892] 2022/04/19 20:28:07.588347 [DBG] Trapped "hangup" signal
[25892] 2022/04/19 20:28:07.592906 [INF] Reloaded: authorization users
[25892] 2022/04/19 20:28:07.593009 [INF] Reloaded: accounts
[25892] 2022/04/19 20:28:07.593192 [INF] Reloaded: cluster

Will see if a reload prints that without any change using your nats-conf repo. Thanks!

kozlovic · 2022-04-20T00:42:07Z

Yes, I have verified that this is printed even without configuration modification. I will investigate. Thank you for your patience.

LLLLimbo · 2022-04-20T00:42:11Z

I'm sure no changes have been made to the configuration file.
So you can continue to use the configuration in the nats-conf repo.

LLLLimbo · 2022-04-20T00:45:19Z

Yes, I have verified that this is printed even without configuration modification. I will investigate. Thank you for your patience.

Thank you also for your continued interest in this issue! I will always cooperate if there is any need.

kozlovic · 2022-04-20T00:49:19Z

Well, that's embarrassing, I can easily reproduce with the config reload. I guess a v2.8.1 will be needed soon. As a workaround, you can restart the server that had a "reload" signal issued, and unlike the original issue, restarting the server will actually fix the interest propagation issue.

Again, thank you for your patience and sorry for all the troubles.

LLLLimbo · 2022-04-20T00:51:27Z

@kozlovic
Thank you for your quick response !

kozlovic · 2022-04-20T01:02:13Z

Weird, the test case I created for the original issue, when modified to do a config reload, does not show the problem. So will need more time to investigate.

kozlovic · 2022-04-20T01:04:53Z

Final update for me today, I was actually able to reproduce. I had to modify the test to stop/restart the consumer too, not just the publisher.

When a configuration reload is done, the account's leaf node connections were not transfered to the new instance of the account, causing the interest to not be propagated until a leafnode reconnect or a server restart. Resolves #3009 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

derekcollison assigned wallyqs and variadico Apr 6, 2022

derekcollison assigned kozlovic and matthiashanel Apr 6, 2022

LLLLimbo mentioned this issue Apr 12, 2022

MQTT client can't connect leafnode after cluster restart #3027

Closed

kozlovic mentioned this issue Apr 13, 2022

[FIXED] LeafNode interest propagation with imports/exports #3031

Merged

kozlovic closed this as completed in #3031 Apr 13, 2022

kozlovic reopened this Apr 20, 2022

kozlovic mentioned this issue Apr 20, 2022

[FIXED] LeafNode: propagation interest issue after a config reload #3058

Merged

kozlovic closed this as completed in #3058 Apr 20, 2022

Leaf nodes do not receive messages from remote nodes properly #3009

Leaf nodes do not receive messages from remote nodes properly #3009

Comments

LLLLimbo commented Apr 6, 2022 • edited

Description

derekcollison commented Apr 6, 2022

LLLLimbo commented Apr 6, 2022 • edited

kozlovic commented Apr 6, 2022

LLLLimbo commented Apr 6, 2022 • edited

kozlovic commented Apr 6, 2022

LLLLimbo commented Apr 6, 2022

kozlovic commented Apr 6, 2022

LLLLimbo commented Apr 6, 2022

kozlovic commented Apr 7, 2022

LLLLimbo commented Apr 7, 2022

LLLLimbo commented Apr 7, 2022 • edited

kozlovic commented Apr 7, 2022

LLLLimbo commented Apr 7, 2022

LLLLimbo commented Apr 7, 2022 • edited

kozlovic commented Apr 7, 2022

LLLLimbo commented Apr 8, 2022 • edited

LLLLimbo commented Apr 8, 2022

kozlovic commented Apr 8, 2022

LLLLimbo commented Apr 8, 2022

LLLLimbo commented Apr 11, 2022 • edited

LLLLimbo commented Apr 11, 2022 • edited

kozlovic commented Apr 11, 2022

LLLLimbo commented Apr 11, 2022 • edited

LLLLimbo commented Apr 12, 2022 • edited

kozlovic commented Apr 13, 2022

LLLLimbo commented Apr 20, 2022

kozlovic commented Apr 20, 2022

LLLLimbo commented Apr 20, 2022

kozlovic commented Apr 20, 2022

kozlovic commented Apr 20, 2022

LLLLimbo commented Apr 20, 2022

LLLLimbo commented Apr 20, 2022

kozlovic commented Apr 20, 2022

LLLLimbo commented Apr 20, 2022

kozlovic commented Apr 20, 2022

kozlovic commented Apr 20, 2022

LLLLimbo commented Apr 6, 2022 •

edited

LLLLimbo commented Apr 6, 2022 •

edited

LLLLimbo commented Apr 6, 2022 •

edited

LLLLimbo commented Apr 7, 2022 •

edited

LLLLimbo commented Apr 7, 2022 •

edited

LLLLimbo commented Apr 8, 2022 •

edited

LLLLimbo commented Apr 11, 2022 •

edited

LLLLimbo commented Apr 11, 2022 •

edited

LLLLimbo commented Apr 11, 2022 •

edited

LLLLimbo commented Apr 12, 2022 •

edited