Class C messages intermittently fail to schedule with invalid absolute time #3487

virtualguy · 2020-11-17T03:11:16Z

Summary

Class C messages intermittently fail to schedule with "invalid absolute time set in application downlink" when closely spaced.

Is there a minimum time between messages to the same device? Ideally we would have them scheduled 130ms apart but have tried to increase the spacing to 1000ms to mitigate the issue

Steps to Reproduce

Schedule downlink at current time + 7 sec to gateway X
Schedule downlink at current time + 8 sec to gateway X
Wait for 10 seconds and observe the mqtt topic DowlinkFailed
If required run in a loop to increase the chance of reproducing the error

We see this approximately every 15 dowlinks

What do you see now?

17/11/2020 14:09:30.783 +1300	Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:37.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0003582"}}]}}]}
17/11/2020 14:09:30.883 +1300	Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:38.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}]}}]}
17/11/2020 14:09:37.068 +1300	Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}
17/11/2020 14:09:37.068 +1300	Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}

What do you want to see instead?

No downlink failures and messages emitted from the gateway

Environment

The Things Stack for LoRaWAN: ttn-lw-stack
Version: 3.9.4
Build date: 2020-09-23T09:56:19Z
Git commit: c4be55c
Go version: go1.15.2
OS/Arch: linux/amd64

How do you propose to implement this?

Determine the causes for the invalid absolute time and resolve if there is a bug

How do you propose to test this?

Happy to test a PR in our dev environment

Can you do this yourself and submit a Pull Request?

@rvolosatovs ?

The text was updated successfully, but these errors were encountered:

johanstokking · 2020-12-23T10:17:13Z

@virtualguy is this issue persisting?

What is the data rate and in which region are you?

When you say you want to transmit with 130 ms apart, are you taking the time-on-air into account?

Can you subscribe to the gateway and end device events, via $ ttn-lw-cli events ..., and paste the exact error messages that Gateway Server reports on why the scheduling fails?

johanstokking · 2021-01-22T12:49:07Z

@virtualguy I added some extra debug lines. Pick any of:

Apply patch investigate-3487.txt
Cherry-pick f9bbda5
Check-out https://github.com/TheThingsNetwork/lorawan-stack/tree/investigate/3487-abs-downlink-timing (based on v3.10.7)

Please grep output by #3487 and copy here. It's also the only output that goes to stdout (as logs go to stderr).

If it says in ScheduleAt that there are too few RTTs, you might want to increase TTN_LW_EXP_RTT_TTL (duration, like 6h for 6 hours, default is 30m for 30 minutes) and/or decrease TTN_LW_EXP_SCHEDULE_MIN_RTT_COUNT to 3 or something (default is 5). These are temporary feature flags.

cc @ymgupta

virtualguy · 2021-02-03T08:14:52Z

fyi we are having some troubling running binaries we have built so blocked on #3736 before we can gather logs from your patch @johanstokking

kurtmc · 2021-02-04T03:41:41Z

@virtualguy @johanstokking Here are some logs from running the patched version: https://gist.github.com/kurtmc/75f4ecf93c2f7a1ee8373d3a9c7f181a

johanstokking · 2021-02-10T11:00:03Z

Thanks a lot. Apologies for the delayed response.

This is going to be super helpful. For finding the root cause, I added a few more log entries.

Can you run this again, with a new build, using https://github.com/TheThingsNetwork/lorawan-stack/tree/investigate/3487-abs-downlink-timing?

If you want to cherry-pick on v3.10.x, cherry-pick 50f5605 and d1e5305

Also for the output, I need the full trace, from the beginning.

Note that we're now printing the gateway ID (or EUI). If that is sensitive information, please redact that to another meaningful value or send the log via email.

johanstokking · 2021-02-12T19:02:46Z

@kurtmc @virtualguy let me know how we can help this setting this up. If I need to send you a binary of Docker image, just let me know.

kurtmc · 2021-02-14T22:36:59Z

@johanstokking Updated logs: https://gist.github.com/kurtmc/041dd593d24fd9f01784e56ec1deb325

johanstokking · 2021-02-15T08:57:10Z

Thanks @kurtmc. I don't see the "no absolute time" errors appearing here, only in the beginning but that is normal. So here, everything worked as expected, right?

@adriansmares the race for which synchronization is fixed with #3794 is actually happening here. So that is real, see:

"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Record: 30.629582ms at 2021-02-14 21:58:36.014795727 +0000 UTC m=+86.734546258"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"

Note that these statements are out-of-order because of underlying flushing; it looks like short statements are flushed immediately and longer statements are delayed.

I think that the issue is that when (*rtts).Record() releases the write lock, both concurrent (*rtts).Stats() calls acquire a read lock, and so two concurrent (*Scheduler).ScheduleAt() calls become exactly in sync. As those are acquiring (another) read lock, they happen concurrently, leading to corruption in Scheduler state.

This is fixed with #3794.

johanstokking · 2021-02-15T09:29:09Z

@virtualguy @kurtmc can you verify that this is resolved on latest v3.11?

Note the minor bump, you need to run DB migrations, see https://github.com/TheThingsNetwork/lorawan-stack/blob/e56f7f70e60dba8c1ad584411fb63a8c35659e7c/CHANGELOG.md#3110---2021-02-10

We'll be rolling a 3.11.1 release today.

johanstokking · 2021-02-15T09:29:34Z

Closed by #3794

johanstokking · 2021-02-15T10:38:43Z

@virtualguy @kurtmc FYI we're backporting this to 3.10.10 so we can update our infrastructure sooner. Please keep an eye on #3800 and/or subscribe to release notifications here.

kurtmc · 2021-02-15T23:46:53Z

@johanstokking Just to let you know, we have upgraded our production environment to 3.10.10 and we are still seeing the absolute time error in the logs.

johanstokking · 2021-02-16T08:50:23Z

@kurtmc it is expected in the following case:

The gateway does not provide GPS timestamps, so there is no absolute time on the gateway and it must be calculated on the gateway
The gateway has not transmitted more than 5 downlink messages
The gateway has not confirmed more than 5 downlink messages (via TX acknowledgment)

We use the latency between scheduling the downlink message and receiving the TX acknowledgment as the round-trip time. We need at least 5 of them to reliable take the median value. Then, when scheduling a class C downlink message with absolute time, Gateway Server uses the server time and the median round-trip time to calculate the absolute (server) time and the corresponding concentrator timestamp.

If you see absolute time errors still, outside the cases above, please provide DEBUG level server logs.

virtualguy · 2021-02-23T09:59:38Z

Definitely still seeing this issue in 3.10.10, looks like it happens on back to back transmissions (1000ms apart on the same device-id). I have sent through DEBUG logs via TTI support

rvolosatovs self-assigned this Nov 30, 2020

rvolosatovs added this to the December 2020 milestone Nov 30, 2020

rvolosatovs added the c/network server This is related to the Network Server label Nov 30, 2020

rvolosatovs added c/gateway server This is related to the Gateway Server and removed c/network server This is related to the Network Server labels Dec 22, 2020

rvolosatovs assigned johanstokking and unassigned rvolosatovs Dec 22, 2020

johanstokking modified the milestones: December 2020, January 2021 Jan 4, 2021

johanstokking added the needs/details This is missing some details label Jan 5, 2021

johanstokking closed this as completed Jan 5, 2021

johanstokking removed the needs/details This is missing some details label Jan 22, 2021

johanstokking reopened this Jan 22, 2021

htdvisser modified the milestones: January 2021, February 2021 Feb 1, 2021

johanstokking added a commit that referenced this issue Feb 10, 2021

gs: Add debug prints for #3487

50f5605

johanstokking added a commit that referenced this issue Feb 10, 2021

gs: Add more debug prints for #3487

d1e5305

johanstokking mentioned this issue Feb 12, 2021

Gateway Server should wait before scheduling consecutive downlinks on a singular gateway #3334

Closed

johanstokking added bug Something isn't working needs/details This is missing some details labels Feb 12, 2021

This comment has been minimized.

Sign in to view

johanstokking removed the needs/details This is missing some details label Feb 15, 2021

johanstokking closed this as completed Feb 15, 2021

ymgupta mentioned this issue Nov 3, 2021

Improve the Device Troubleshooting guide with no_absolute_gateway_time & scheduling conflict errors. TheThingsIndustries/lorawan-stack-docs#647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Class C messages intermittently fail to schedule with invalid absolute time #3487

Class C messages intermittently fail to schedule with invalid absolute time #3487

virtualguy commented Nov 17, 2020

johanstokking commented Dec 23, 2020

johanstokking commented Jan 22, 2021 •

edited

virtualguy commented Feb 3, 2021

kurtmc commented Feb 4, 2021

johanstokking commented Feb 10, 2021

johanstokking commented Feb 12, 2021

kurtmc commented Feb 14, 2021

johanstokking commented Feb 15, 2021 •

edited

This comment has been minimized.

johanstokking commented Feb 15, 2021

johanstokking commented Feb 15, 2021

johanstokking commented Feb 15, 2021

kurtmc commented Feb 15, 2021

johanstokking commented Feb 16, 2021

virtualguy commented Feb 23, 2021 •

edited

Class C messages intermittently fail to schedule with invalid absolute time #3487

Class C messages intermittently fail to schedule with invalid absolute time #3487

Comments

virtualguy commented Nov 17, 2020

Summary

Steps to Reproduce

What do you see now?

What do you want to see instead?

Environment

How do you propose to implement this?

How do you propose to test this?

Can you do this yourself and submit a Pull Request?

johanstokking commented Dec 23, 2020

johanstokking commented Jan 22, 2021 • edited

virtualguy commented Feb 3, 2021

kurtmc commented Feb 4, 2021

johanstokking commented Feb 10, 2021

johanstokking commented Feb 12, 2021

kurtmc commented Feb 14, 2021

johanstokking commented Feb 15, 2021 • edited

This comment has been minimized.

johanstokking commented Feb 15, 2021

johanstokking commented Feb 15, 2021

johanstokking commented Feb 15, 2021

kurtmc commented Feb 15, 2021

johanstokking commented Feb 16, 2021

virtualguy commented Feb 23, 2021 • edited

johanstokking commented Jan 22, 2021 •

edited

johanstokking commented Feb 15, 2021 •

edited

virtualguy commented Feb 23, 2021 •

edited