Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NATS Deleting Recovered Stream as Orphaned #5382

Open
thorntonmc opened this issue May 2, 2024 · 5 comments
Open

NATS Deleting Recovered Stream as Orphaned #5382

thorntonmc opened this issue May 2, 2024 · 5 comments
Labels
defect Suspected defect such as a bug or regression

Comments

@thorntonmc
Copy link

Observed behavior

NATS recovered messages from a stream, but then deleted messages afterwards

[1] 2024/05/02 14:25:12.795323 [INF]   Max Storage:     1000.00 GB
[1] 2024/05/02 14:25:12.795329 [INF]   Store Directory: "/nats/jetstream"
[1] 2024/05/02 14:25:12.795333 [INF] -------------------------------------------
[1] 2024/05/02 14:25:12.796142 [INF]   Starting restore for stream '$G > BR'
[1] 2024/05/02 14:25:12.822254 [INF]   Restored 771 messages for stream '$G > BR'
[1] 2024/05/02 14:25:12.822472 [INF]   Starting restore for stream '$G > LR'
[1] 2024/05/02 14:25:12.822902 [INF]   Restored 0 messages for stream '$G > LR'
[1] 2024/05/02 14:25:12.823062 [INF]   Starting restore for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823418 [INF]   Restored 0 messages for stream '$G > OEN'
[1] 2024/05/02 14:25:12.823531 [INF]   Starting restore for stream '$G > OR'
[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.868300 [INF]   Recovering 3 consumers for stream - '$G > OEN'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'
[1] 2024/05/02 14:25:30.201230 [INF] Starting JetStream cluster
[1] 2024/05/02 14:25:30.201246 [INF] Creating JetStream metadata controller
[1] 2024/05/02 14:25:30.201507 [INF] JetStream cluster bootstrapping
[1] 2024/05/02 14:25:30.201980 [INF] Listening for client connections on 0.0.0.0:4222
[1] 2024/05/02 14:25:30.202065 [WRN] Detected orphaned stream '$G > BR', will cleanup
[1] 2024/05/02 14:25:30.202342 [INF] Server is ready
[1] 2024/05/02 14:25:30.202457 [INF] Cluster name is gq-nats
[1] 2024/05/02 14:25:30.202537 [INF] Listening for route connections on 0.0.0.0:6222
[1] 2024/05/02 14:25:30.208864 [ERR] Error trying to connect to route (attempt 1): lookup for host "gq-nats-0.gq-nats.generic-queue.svc.cluster.local": lookup gq-nats-0.gq-nats.generic-queue.svc.cluster.local on 10.96.0.10:53: no such host
[1] 2024/05/02 14:25:30.239648 [WRN] Detected orphaned stream '$G > LR', will cleanup
[1] 2024/05/02 14:25:30.240497 [WRN] Detected orphaned stream '$G > OEN', will cleanup
[1] 2024/05/02 14:25:30.243571 [WRN] Detected orphaned stream '$G > OR', will cleanup

Expected behavior

NATS should not delete the "OR" stream - as it and its consumers were recovered

[1] 2024/05/02 14:25:29.868233 [INF]   Restored 447,917,984 messages for stream '$G > OR'
[1] 2024/05/02 14:25:29.870547 [INF]   Recovering 852 consumers for stream - '$G > OR'

Server and client version

2.9.15

Host environment

Running as a Kubernetes statefulset.

Steps to reproduce

The "OR" stream was unavailable at the time of restart. The OR stream runs on a single node - referred to here as gq-nats-1

A series of issues with NATS began after we created a new stream that was tag-located to gq-nats-1:

2024-05-02 09:56:55	
[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.538942 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
	
2024-05-02 09:56:31	
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'

This then drove us to restart the service.

@thorntonmc thorntonmc added the defect Suspected defect such as a bug or regression label May 2, 2024
@derekcollison
Copy link
Member

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

@thorntonmc thorntonmc changed the title NATS Deleting Recovered Stream as "Orphaned" NATS Deleting Recovered Stream as Orphaned May 3, 2024
@thorntonmc
Copy link
Author

thorntonmc commented May 3, 2024

Orphaned means the server could not find any meta assignment from the meta layer after synching with up.

Trying to understand this a bit more - here's the order of what happened:

  1. NATS restarts in a bad state
  2. The node in question comes up, sees messages and consumers from a stream called "OR", recovers them
  3. Doesn't see that stream in the meta layer, deletes the stream
  4. Later the same stream appears with 0 messages, on that same node.

If the stream doesn't exist in the meta layer after syncing up - why does the stream appear on the same node moments later?

@derekcollison
Copy link
Member

Could you add some more information to restarts in a bad state?

@thorntonmc
Copy link
Author

thorntonmc commented May 3, 2024

Could you add some more information to restarts in a bad state?

Here's the timeline - at 13:56UTC a new stream is created using the NATS client, bound to that node (gq-nats-1). We then notice these logs with what appears to be the node re-detecting every consumer for the stream - this happens several hundreds of times:

[1] 2024/05/02 13:56:20.039173 [INF] JetStream cluster new consumer leader for '$G > OR > [redacted]

After the "new consumer" logs stop - wee see these errors:

2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.208975 [INF] JetStream Shutdown
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.069191 [INF] Initiating JetStream Shutdown...
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.039035 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: no message found
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038791 [ERR] JetStream out of resources, will be DISABLED
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038668 [WRN] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error loading 2 index: malformed or corrupt message
	
2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.038632 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Critical write error: malformed or corrupt message
	
2024-05-02 09:56:31	
[1] 2024/05/02 13:56:31.931646 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.903391 [INF] Transfer of stream leader for '$G > BR' to 'gq-nats-2'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402715 [WRN] Internal subscription on "$JS.API.STREAM.INFO.BR" took too long: 3.508628383s
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402701 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402561 [INF] JetStream cluster new stream leader for '$G > BR'
	
2024-05-02 09:56:27	
[1] 2024/05/02 13:56:27.402500 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.593972 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.593812 [INF] Transfer of consumer leader for '$G > BR > br_8ed157d3-8d9b-4fb0-b9bf-80008c3f176e-backup-manuell^b15101d1-01b8-480b-bd58-3ad0dd1d527b-backup-manuell-replication^prod8' to 'gq-nats-2'
	
2024-05-02 09:56:24	
[1] 2024/05/02 13:56:24.363958 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:23	
[1] 2024/05/02 13:56:23.863979 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:23	
[1] 2024/05/02 13:56:23.364262 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:22	
[1] 2024/05/02 13:56:22.864354 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:22	
[1] 2024/05/02 13:56:22.363929 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:21	
[1] 2024/05/02 13:56:21.863344 [INF] Scaling down '$G > LR' to [gq-nats-1]
	
2024-05-02 09:56:21	
[1] 2024/05/02 13:56:21.363622 [INF] Scaling down '$G > LR' to [gq-nats-1]

Followed by repeated logging of the following:

[1] 2024/05/02 13:56:55.927633 [WRN] JetStream cluster stream '$G > BR' has NO quorum, stalled

and

2024-05-02 09:56:33	
[1] 2024/05/02 13:56:33.533104 [ERR] RAFT [RhnJXf0c - S-R1F-dLrXFs2V] Got an error apply commit for 2: raft: could not load entry from WAL

At this point - the node is unavailable, as are all the streams located on it - which prompts the restart of the cluster using kubectl rollout restart.

@derekcollison
Copy link
Member

It says it ran out of resources and shutdown JetStream. We should address that first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants