NATS Deleting Recovered Stream as Orphaned #5382

thorntonmc · 2024-05-02T17:05:41Z

Observed behavior

NATS recovered messages from a stream, but then deleted messages afterwards

[1] 2024/05/02 14:25:12.795323 [INF]   Max Storage:     1000.00 GB
[1] 2024/05/02 14:25:12.795329 [INF]   Store Directory: "/nats/jetstream"
[1] 2024/05/02 14:25:12.795333 [INF] -------------------------------------------
[1] 2024/05/02 14:25:12.796142 [INF]   Starting restore for stream '$G > BR'
[1] 2024/05/02 14:25:12.822254 [INF]   Restored 771 messages for stream '$G > BR'
[1] 2024/05/02 14:25:12.822472 [INF]   Starting restore for stream '$G > LR'
[1] 2024/05/02 14:25:12.822902 [INF]   Restored 0 messages for stream '$G > LR'
[1] 2024/05/02 14:25:12.823062 [INF]   Starting restore for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823418 [INF]   Restored 0 messages for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823531 [INF]   Starting restore for stream '$G > OR'
[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.868300 [INF]   Recovering 3 consumers for stream - '$G > OEN'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'
[1] 2024/05/02 14:25:30.201230 [INF] Starting JetStream cluster
[1] 2024/05/02 14:25:30.201246 [INF] Creating JetStream metadata controller
[1] 2024/05/02 14:25:30.201507 [INF] JetStream cluster bootstrapping
[1] 2024/05/02 14:25:30.201980 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2024/05/02 14:25:30.202065 [WRN] Detected orphaned stream '$G > BR', will cleanup
[1] 2024/05/02 14:25:30.202342 [INF] Server is ready
[1] 2024/05/02 14:25:30.202457 [INF] Cluster name is gq-nats
[1] 2024/05/02 14:25:30.202537 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2024/05/02 14:25:30.208864 [ERR] Error trying to connect to route (attempt 1): lookup for host "gq-nats-0.gq-nats.generic-queue.svc.cluster.local": lookup gq-nats-0.gq-nats.generic-queue.svc.cluster.local on 10.96.0.10:53: no such host
[1] 2024/05/02 14:25:30.239648 [WRN] Detected orphaned stream '$G > LR', will cleanup
[1] 2024/05/02 14:25:30.240497 [WRN] Detected orphaned stream '$G > OEN', will cleanup
[1] 2024/05/02 14:25:30.243571 [WRN] Detected orphaned stream '$G > OR', will cleanup

Expected behavior

NATS should not delete the "OR" stream - as it and its consumers were recovered

[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'

Server and client version

2.9.15

Host environment

Running as a Kubernetes statefulset.

Steps to reproduce

The "OR" stream was unavailable at the time of restart. The OR stream runs on a single node - referred to here as gq-nats-1

A series of issues with NATS began after we created a new stream that was tag-located to gq-nats-1:

2024-05-02 09:56:55	
[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.538942 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
	
2024-05-02 09:56:31	
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

This then drove us to restart the service.

The text was updated successfully, but these errors were encountered:

derekcollison · 2024-05-02T22:29:02Z

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

thorntonmc · 2024-05-03T04:26:13Z

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

Trying to understand this a bit more - here's the order of what happened:

NATS restarts in a bad state
The node in question comes up, sees messages and consumers from a stream called "OR", recovers them
Doesn't see that stream in the meta layer, deletes the stream
Later the same stream appears with 0 messages, on that same node.

If the stream doesn't exist in the meta layer after syncing up - why does the stream appear on the same node moments later?

derekcollison · 2024-05-03T18:18:24Z

Could you add some more information to restarts in a bad state?

thorntonmc · 2024-05-03T18:28:55Z

Could you add some more information to restarts in a bad state?

Here's the timeline - at 13:56UTC a new stream is created using the NATS client, bound to that node (gq-nats-1). We then notice these logs with what appears to be the node re-detecting every consumer for the stream - this happens several hundreds of times:

[1] 2024/05/02 13:56:20.039173 [INF] JetStream cluster new consumer leader for '$G > OR > [redacted]

After the "new consumer" logs stop - wee see these errors:

2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
	
2024-05-02 09:56:31	
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402715 [WRN] Internal subscription on "$JS.API.STREAM.INFO.BR" took too long: 3.508628383s
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402701 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402561 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402500 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.593972 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.593812 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.363958 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:23	
[1] 2024/05/02 13:56:23.863979 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:23	
[1] 2024/05/02 13:56:23.364262 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:22	
[1] 2024/05/02 13:56:22.864354 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:22	
[1] 2024/05/02 13:56:22.363929 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:21	
[1] 2024/05/02 13:56:21.863344 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:21	
[1] 2024/05/02 13:56:21.363622 [INF] Scaling down '$G > LR' to [gq-nats-1]

Followed by repeated logging of the following:

[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled

and

2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.533104 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error apply commit for 2: raft: could not load entry from WAL

At this point - the node is unavailable, as are all the streams located on it - which prompts the restart of the cluster using kubectl rollout restart.

derekcollison · 2024-05-03T21:27:21Z

It says it ran out of resources and shutdown JetStream. We should address that first.

thorntonmc added the defect Suspected defect such as a bug or regression label May 2, 2024

thorntonmc changed the title ~~NATS Deleting Recovered Stream as "Orphaned"~~ NATS Deleting Recovered Stream as Orphaned May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NATS Deleting Recovered Stream as Orphaned #5382

NATS Deleting Recovered Stream as Orphaned #5382

thorntonmc commented May 2, 2024

derekcollison commented May 2, 2024

thorntonmc commented May 3, 2024 •

edited

derekcollison commented May 3, 2024

thorntonmc commented May 3, 2024 •

edited

derekcollison commented May 3, 2024

NATS Deleting Recovered Stream as Orphaned #5382

NATS Deleting Recovered Stream as Orphaned #5382

Comments

thorntonmc commented May 2, 2024

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

derekcollison commented May 2, 2024

thorntonmc commented May 3, 2024 • edited

derekcollison commented May 3, 2024

thorntonmc commented May 3, 2024 • edited

derekcollison commented May 3, 2024

thorntonmc commented May 3, 2024 •

edited

thorntonmc commented May 3, 2024 •

edited