Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaf nodes do not receive messages from remote nodes properly #3009

Closed
LLLLimbo opened this issue Apr 6, 2022 · 36 comments · Fixed by #3031 or #3058
Closed

Leaf nodes do not receive messages from remote nodes properly #3009

LLLLimbo opened this issue Apr 6, 2022 · 36 comments · Fixed by #3031 or #3058
Assignees

Comments

@LLLLimbo
Copy link

LLLLimbo commented Apr 6, 2022

Description

I started a NATS cluster with three remote nodes and one leaf node. They are configured as follows:
https://github.com/LLLLimbo/nats-conf/tree/main
At first they work fine, the leaf nodes receive the messages correctly when I publish them to the remote nodes and vice versa, but whenever they run for a while, it happens that the leaf nodes do not receive the messages from the remote nodes correctly.
GIF

The problem may be related to jetstream, when I delete the storage file of jetstream, the cluster is back to normal.
2022-04-12 correction:A simple restart restores communication between the remote node and the leaf node.

Environment

  • nats-version : 2.7.3
  • system: centos 7.6
@derekcollison
Copy link
Member

I suspect the authorization block for the servers receiving leafnode connections may be incorrect.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 6, 2022

I suspect the authorization block for the servers receiving leafnode connections may be incorrect.

I encountered this problem again, this time I tried to delete only the Jetstream storage on the leaf node and the cluster returned to normal. This is hard for me to understand, if it's an authorization configuration error, then I should never be able to receive messages from the remote node in the leaf node.

@kozlovic
Copy link
Member

kozlovic commented Apr 6, 2022

@LLLLimbo Would you mind sending some reports from the NATS cli tools (nats) to see if you have some streams defined, how their configuration look like, etc? I believe you are using MQTT too, are you having MQTT applications subscribing to "test"? Usually, how long does it take before the failure start to occur? Any servers restart before it starts to fail?

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 6, 2022

@LLLLimbo Would you mind sending some reports from the NATS cli tools (nats) to see if you have some streams defined, how their configuration look like, etc? I believe you are using MQTT too, are you having MQTT applications subscribing to "test"? Usually, how long does it take before the failure start to occur? Any servers restart before it starts to fail?

  1. I didn't define any jetstream streams.
  2. I don't have any MQTT client subscribes to test
  3. I can't give a specific time, but usually after a few hours
  4. There is no reboot

The following information is obtained(from leafnode) when the cluster is working properly.

PS C:\DevelopmentKit\nats\nats-cli> ./nats --server nats://seeiner-edge:seeiner123@192.168.3.131:4222 account info
Connection Information:

               Client ID: 25
               Client IP: 192.168.9.251
                     RTT: 2.2613ms
       Headers Supported: true
         Maximum Payload: 1.0 MiB
       Connected Cluster: nats-edge
           Connected URL: nats://seeiner-edge:seeiner123@192.168.3.131:4222
       Connected Address: 192.168.3.131:4222
     Connected Server ID: NBSKY74RGOL5DQ32T77BU5J73GBL6MPGDDGUIJRYU54YQNABE2DMX6EE
   Connected Server Name: nats-edge

JetStream Account Information:

           Memory: 0 B of Unlimited
          Storage: 816 B of Unlimited
          Streams: 3 of Unlimited
        Consumers: 1 of Unlimited
PS C:\DevelopmentKit\nats\nats-cli> ./nats --server nats://seeiner-edge:seeiner123@192.168.3.131:4222 stream ls
No Streams defined, pass -a to include system streams

This is all the connections currently on the leaf node:

{
  "server_id": "NBSKY74RGOL5DQ32T77BU5J73GBL6MPGDDGUIJRYU54YQNABE2DMX6EE",
  "now": "2022-04-06T23:48:20.834615611Z",
  "num_connections": 4,
  "total": 4,
  "offset": 0,
  "limit": 1024,
  "connections": [
    {
      "cid": 7,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52932,
      "start": "2022-04-06T23:02:58.365921518Z",
      "last_activity": "2022-04-06T23:47:55.062696094Z",
      "rtt": "0s",
      "uptime": "45m22s",
      "idle": "25s",
      "pending_bytes": 0,
      "in_msgs": 91,
      "out_msgs": 0,
      "in_bytes": 2227,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_uQGgAMVp"
    },
    {
      "cid": 21,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52938,
      "start": "2022-04-06T23:03:11.743418673Z",
      "last_activity": "2022-04-06T23:47:53.791111975Z",
      "rtt": "0s",
      "uptime": "45m9s",
      "idle": "27s",
      "pending_bytes": 0,
      "in_msgs": 91,
      "out_msgs": 0,
      "in_bytes": 2227,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_od5c2EpB"
    },
    {
      "cid": 22,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52944,
      "start": "2022-04-06T23:03:13.318036621Z",
      "last_activity": "2022-04-06T23:48:10.594085382Z",
      "rtt": "0s",
      "uptime": "45m7s",
      "idle": "10s",
      "pending_bytes": 0,
      "in_msgs": 89,
      "out_msgs": 0,
      "in_bytes": 2181,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_1WQ7CrFJ"
    },
    {
      "cid": 23,
      "kind": "Client",
      "type": "mqtt",
      "ip": "192.168.3.101",
      "port": 52948,
      "start": "2022-04-06T23:03:40.668665408Z",
      "last_activity": "2022-04-06T23:48:15.109230459Z",
      "rtt": "0s",
      "uptime": "44m40s",
      "idle": "5s",
      "pending_bytes": 0,
      "in_msgs": 88,
      "out_msgs": 0,
      "in_bytes": 2158,
      "out_bytes": 0,
      "subscriptions": 2,
      "mqtt_client": "DEVICE_67cFBdz-"
    }
  ]
}

@kozlovic
Copy link
Member

kozlovic commented Apr 6, 2022

I am guessing that nats stream ls -a will list the 3 MQTT streams, and nothing more, right?

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 6, 2022

There are indeed three MQTT streams

PS C:\DevelopmentKit\nats\nats-cli> ./nats --server nats://seeiner-edge:seeiner123@192.168.3.131:4222 stream ls -a
╭───────────────────────────────────────────────────────────────────────────────────╮
│                                      Streams                                      │
├─────────────┬─────────────┬─────────────────────┬──────────┬───────┬──────────────┤
│ Name        │ Description │ Created             │ Messages │ Size  │ Last Message │
├─────────────┼─────────────┼─────────────────────┼──────────┼───────┼──────────────┤
│ $MQTT_msgs  │             │ 2022-04-06 23:04:58 │ 0        │ 0 B   │ never        │
│ $MQTT_rmsgs │             │ 2022-04-06 23:04:58 │ 0        │ 0 B   │ never        │
│ $MQTT_sess  │             │ 2022-04-06 23:04:58 │ 4        │ 816 B │ 45m47s       │
╰─────────────┴─────────────┴─────────────────────┴──────────┴───────┴──────────────╯

@kozlovic
Copy link
Member

kozlovic commented Apr 6, 2022

What are the server versions? Have you tried with the latest from main or nightly build?

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 6, 2022

What are the server versions? Have you tried with the latest from main or nightly build?

The version I am currently using is 2.7.3, but I had the same problem on 2.7.4.

@kozlovic
Copy link
Member

kozlovic commented Apr 7, 2022

You are normally using MQTT clients, but for the issue you are reporting, this is core NATS, subscribing on "test" (or any other subject). What made you try that? Did you notice communication issues that made you try the nats sub? In the screen capture, it seems like you are simply subscribing on "test", not on a queue, which I know we have fixed an issue related to queue group and leafnodes (#2901).

Anyway, thank you for all the details. Will try to reproduce but it may take time if, as you reported, it does not happen right away.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 7, 2022

In general, we only allow devices to connect to leaf nodes using MQTT. Ideally, our back-end application uses the NATS protocol to connect to a remote NATS node and then sends a message to the device, which receives it on the leaf node. Then we ran into the problem described in this issue, which we initially thought was related to MQTT, but I still ran into the problem after trying to connect to the leaf node using a NATS client.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 7, 2022

I will also continue to try to find the trigger conditions for the issue.

@kozlovic
Copy link
Member

kozlovic commented Apr 7, 2022

@LLLLimbo I was not able to reproduce so far. What I would recommend is that you run with -DV (debug and trace) and make sure that your log are not limited (as they are now to 100MB) to make sure we capture the lack of interest propagation (if this is is what is happening). Thank you for your cooperation. (I will still continue to see if I can reproduce, but does not seem to be an obvious issue).

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 7, 2022

@LLLLimbo I was not able to reproduce so far. What I would recommend is that you run with -DV (debug and trace) and make sure that your log are not limited (as they are now to 100MB) to make sure we capture the lack of interest propagation (if this is is what is happening). Thank you for your cooperation. (I will still continue to see if I can reproduce, but does not seem to be an obvious issue).

Thank you for your assistance! I will try to capture the relevant information in the logs. However, the problem hasn't happened since I last manually cleaned the jetstream storage on the leaf nodes and I need to keep waiting.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 7, 2022

@kozlovic
I have a log of the leaf node here, the first reboot occurred on 2022/04/07 07:02:54, the purpose of the reboot was to verify that deleting only the leaf node jetstream storage would bring communication back to normal. After this reboot, as of now there are no problems at the moment
logs.zip
.

@kozlovic
Copy link
Member

kozlovic commented Apr 7, 2022

@LLLLimbo I see the SUB test 1 and the sent of interest, and then receiving the message. So as you say, here everything is working fine. If you can reproduce, I would need the logs of ALL servers, to see the interest propagation (or lack thereof) and the log from the server the publisher publishes to.
I understand that it may take some time to get it to fail and logs may become big, so if you have to recycle them (or keep the size limit) I understand, hopefully it will still cover whatever event that initiated the issue.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 8, 2022

@kozlovic
This is the full log . https://github.com/LLLLimbo/nats-conf/releases/tag/1.0
The remote node I have also performed a manual reboot, but the purpose of the reboot was to try to reproduce the issue.
The problem occurred between the time NATS started and 2022/04/07 07:02:54 when I restarted the leaf node

Sorry I forgot to state that what I did in the gif happened before I redeployed the entire cluster, so I'll try to collect a clean log again.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 8, 2022

Delaying PING due to client activity appears frequently in the logs of the leaf nodes, I wonder if this is the possible cause.

@kozlovic
Copy link
Member

kozlovic commented Apr 8, 2022

Any timeframe you could give me when you created the SUB on test on the "edge" server and that you then tried to send (from which server) and that message was not received?

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 8, 2022

Any timeframe you could give me when you created the SUB on test on the "edge" server and that you then tried to send (from which server) and that message was not received?

This happens before I redeploy the whole cluster and I will try to reproduce and collect new logs for you.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 11, 2022

@kozlovic
I encountered the problem again, but this time it took two days.
You can see in log 131 that I subscribed to downwards.device.msg.202204110827 on 2022-04-11 08:27:26, and in log 134 that I publisha message to that subject on 2022-04-11 08:27:28.
The cluster was fine until yesterday morning (you can see on the 131 log that I subscribed to downwards.device.msg on 2022/04/10 09:53:12)
https://github.com/LLLLimbo/nats-conf/releases/tag/2022-04-11

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 11, 2022

When I access the web monitor (http://192.168.3.131:8222/leafz) of 131 (i.e. the leaf node), it shows that the currently connected remote node is 136.

{
  "server_id": "NB73QCEZBAJ4U2ASAMURWZICPZXXKTM456RJGCLJH46BL6HSRWAISQIU",
  "now": "2022-04-11T00:44:50.65748122Z",
  "leafnodes": 1,
  "leafs": [
    {
      "account": "SEEINER",
      "ip": "192.168.3.136",
      "port": 7422,
      "rtt": "1.65919ms",
      "in_msgs": 86,
      "out_msgs": 1244,
      "in_bytes": 14805,
      "out_bytes": 69153,
      "subscriptions": 58
    }
  ]
}

@kozlovic
Copy link
Member

@LLLLimbo I am sorry, but downloading log-136.zip shows that this is the log of nats-edge, which is 131. The 134 and 135 logs correctly display logs for nodes nats-test-node-1 and nats-test-node-2 respectively, but 136 logs seem of the wrong server.

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 11, 2022

@LLLLimbo I am sorry, but downloading log-136.zip shows that this is the log of nats-edge, which is 131. The 134 and 135 logs correctly display logs for nodes nats-test-node-1 and nats-test-node-2 respectively, but 136 logs seem of the wrong server.

My fault, I have re-uploaded.
Please note that I did not clear the previous logs of 136 when redeploying the cluster (2022/04/08 12:38:49).

@LLLLimbo
Copy link
Author

LLLLimbo commented Apr 12, 2022

@kozlovic
And a correction to my previous statement:
A simple restart restores communication between the remote node and the leaf node.
But this leads to another problem, so I submitted another issue #3027

kozlovic added a commit that referenced this issue Apr 13, 2022
When using subscriptions through import/exports, the server with
a leafnode connection would properly send the interest over, but
if the connection is recreated, this would not happen.

In case of JetStream where that happens under the cover, message
flow would stop after the leafnode restart because the subscriptions
would be created on recovery of the JetStream assets but *before*
the LeafNode connection could be established.

Resolves #3024
Resolves #3027
Resolves #3009

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
kozlovic added a commit that referenced this issue Apr 13, 2022
When using subscriptions through import/exports, the server with
a leafnode connection would properly send the interest over, but
if the connection is recreated, this would not happen.

In case of JetStream where that happens under the cover, message
flow would stop after the leafnode restart because the subscriptions
would be created on recovery of the JetStream assets but *before*
the LeafNode connection could be established.

Resolves #3024
Resolves #3027
Resolves #3009

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic
Copy link
Member

@LLLLimbo I believe that the issue has to do with a leafnode restarting or recreating a connection and should be addressed in PR that has just been merged. It should be available in the nightly today or in the release v2.8.0 that should be coming out very shortly.

@LLLLimbo
Copy link
Author

@LLLLimbo I believe that the issue has to do with a leafnode restarting or recreating a connection and should be addressed in PR that has just been merged. It should be available in the nightly today or in the release v2.8.0 that should be coming out very shortly.

@kozlovic
After I updated to version 2.8.0, I still encountered this problem and I re-uploaded the log:
https://github.com/LLLLimbo/nats-logs/releases/tag/2022-04-20
You can see on the log 131: I subscribed to downwards.device.msg.202204200812 on the leaf node on 2022/04/20 08:12:21
On the log 136 : I sent a message to downwards.device.msg.202204200812 on 2022/04/20 08:12:25

@kozlovic
Copy link
Member

@LLLLimbo I see that you did a config reload on 131 and auth users was reloaded. Could you describe what was changed? Is the repo https://github.com/LLLLimbo/nats-conf still a good place to see the previous config and you describe the change that was done? I could then try to reproduce and see if that is the reason for the issue.

@kozlovic kozlovic reopened this Apr 20, 2022
@LLLLimbo
Copy link
Author

@kozlovic I did not change the configuration

@kozlovic
Copy link
Member

I saw that in the 131's logs:

[25892] 2022/04/19 20:28:07.588347 [DBG] Trapped "hangup" signal
[25892] 2022/04/19 20:28:07.592906 [INF] Reloaded: authorization users
[25892] 2022/04/19 20:28:07.593009 [INF] Reloaded: accounts
[25892] 2022/04/19 20:28:07.593192 [INF] Reloaded: cluster

Will see if a reload prints that without any change using your nats-conf repo. Thanks!

@kozlovic
Copy link
Member

Yes, I have verified that this is printed even without configuration modification. I will investigate. Thank you for your patience.

@LLLLimbo
Copy link
Author

I'm sure no changes have been made to the configuration file.
So you can continue to use the configuration in the nats-conf repo.

@LLLLimbo
Copy link
Author

Yes, I have verified that this is printed even without configuration modification. I will investigate. Thank you for your patience.

Thank you also for your continued interest in this issue! I will always cooperate if there is any need.

@kozlovic
Copy link
Member

Well, that's embarrassing, I can easily reproduce with the config reload. I guess a v2.8.1 will be needed soon. As a workaround, you can restart the server that had a "reload" signal issued, and unlike the original issue, restarting the server will actually fix the interest propagation issue.

Again, thank you for your patience and sorry for all the troubles.

@LLLLimbo
Copy link
Author

@kozlovic
Thank you for your quick response !

@kozlovic
Copy link
Member

Weird, the test case I created for the original issue, when modified to do a config reload, does not show the problem. So will need more time to investigate.

@kozlovic
Copy link
Member

Final update for me today, I was actually able to reproduce. I had to modify the test to stop/restart the consumer too, not just the publisher.

kozlovic added a commit that referenced this issue Apr 20, 2022
When a configuration reload is done, the account's leaf node connections
were not transfered to the new instance of the account, causing the
interest to not be propagated until a leafnode reconnect or a server
restart.

Resolves #3009

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment