Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining new nodes to recovered vault after quorum lost causes "storage.raft.snapshot: failed to move snapshot into place (Invalid handle)" in Win 10 #12116

Closed
ferAbleTech opened this issue Jul 19, 2021 · 9 comments
Assignees
Labels
bug Used to indicate a potential bug os/windows storage/raft

Comments

@ferAbleTech
Copy link

ferAbleTech commented Jul 19, 2021

Describe the bug
I followed this tutorial to create a Vault cluster with Raft as storage. After simulating an outage with all 3 nodes lost and recovering a single node using peers.json, I tried joining new nodes to the recovered node, however after the command (vault operator raft join http:...) its console throws these errors periodically:

[ERROR] storage.raft: failed to send snapshot to: peer="{Non_voter vault_3 127.0.0.1:8401}" error="sync vault\raft\snapshots: Handle non valido." (last part is in italian, same language as OS, it means Invalid Handle)
[ERROR] storage.raft: failed to get log: index=1 error="log not found"
[ERROR] storage.raft: failed to install snapshot: id=bolt-snapshot error="sync vault\raft\snapshots: Handle non valido."

The joining node console throws these errors:

[INFO] storage.raft.snapshot: creating new snapshot: path=vault\raft\snapshots\......
[ERROR] storage.raft.snapshot: failed to move snapshot into place: error="sync vault\raft\snapshots: Handle non valido"
[ERROR] storage.raft.snapshot: failed to finalize snapshot: error="sync vault\raft\snapshots: Handle non valido"
[INFO] storage.raft.snapshot: reaping snapshot: path=vault\raft\snapshots\.......

This behaviour only happens in Windows, using the online enviroment offered in the tutorial it doesn't happen.

To Reproduce
Steps to reproduce the behavior:
Follow the steps in the tutorial until "Retry Join" to create a cluster of 3 nodes (plus a server for the autounseal using Transit Secret Engine.)
Stop all nodes in the cluster.
Recover vault_2 using the peers.json method.
Try joining a new node to vault_2.

As this is a bit tedious to reproduce in Windows, since the automated script offered in the tutorial only works for Linux OS, I made a semi-automated equivalent using bat files:
https://drive.google.com/file/d/1GHbNmBG0niRkIYB6Qc4KVHdGjPrqPPIi/view?usp=sharing
Follow the README to simulate the error.

Expected behavior
The new nodes successfully joins the cluster.

Environment:

  • Vault Server Version: 1.7.3
  • Vault CLI Version: Vault v1.7.3 (5d517c8)
  • Server Operating System/Architecture: Windows 10 x64 20H2

Vault server configuration file(s):

Autounseal vault:

storage "raft" {
path    = "./vault"
node_id = "vault_1"
}
listener "tcp" {
address = "127.0.0.1:8200"
cluster_address = "127.0.0.1:8201"
tls_disable = true
}

disable_mlock = true
cluster_addr = "http://127.0.0.1:8201"
api_addr = "http://127.0.0.1:8200"

Cluster nodes (with different api/cluster ports):

storage "raft" {
path    = "./vault"
node_id = "vault_2"
}
listener "tcp" {
address = "127.0.0.1:8300"
cluster_address = "127.0.0.1:8301"
tls_disable = true
}

seal "transit" {
address            = "http://127.0.0.1:8200"
# token is read from VAULT_TOKEN env
# token              = ""
disable_renewal    = "false"
key_name           = "autounseal"
mount_path         = "transit/"
tls_skip_verify = "true"
}
disable_mlock = true
cluster_addr = "http://127.0.0.1:8301"
api_addr = "http://127.0.0.1:8300"
ui	=	true
@ferAbleTech ferAbleTech changed the title Joining nodes to recovered vault causes "storage.raft.snapshot: failed to move snapshot into place (Invalid handle)" Joining nodes to recovered vault causes "storage.raft.snapshot: failed to move snapshot into place (Invalid handle)" in Win 10 Jul 19, 2021
@ferAbleTech ferAbleTech changed the title Joining nodes to recovered vault causes "storage.raft.snapshot: failed to move snapshot into place (Invalid handle)" in Win 10 Joining new nodes to recovered vault after quorum lost causes "storage.raft.snapshot: failed to move snapshot into place (Invalid handle)" in Win 10 Jul 19, 2021
@ncabatoff
Copy link
Collaborator

Hi @ferAbleTech,

Just a hunch, but could you try to reproduce this with a fully-qualified path in your storage "raft" config stanza? i.e. instead of path = "./vault", use path = "/path/to/vault", or whatever the appropriate absolute path syntax is for your windows setup.

@ferAbleTech
Copy link
Author

ferAbleTech commented Jul 19, 2021

Hi, I tried to do so, both with fordward slash / and escaped backlash \, however the same error still happens:

[ERROR] storage.raft: failed to send snapshot to: peer="{Nonvoter vault_3 127.0.0.1:8401}" error="sync C:\Work\KeyVault\cluster_with_autounseal - Copia\server_3\vault\raft\snapshots: Handle non valido."

@ncabatoff ncabatoff added bug Used to indicate a potential bug core/raft os/windows labels Jul 19, 2021
@ncabatoff
Copy link
Collaborator

Well that's unfortunate. I should've expected as much, since it seems like the rename itself worked, it's the subsequent fsync that's returning the error. Thanks for the bug report, we'll look into it.

@hsimon-hashicorp hsimon-hashicorp self-assigned this Jul 19, 2021
@hsimon-hashicorp
Copy link
Contributor

Hi @ferAbleTech! Are you using cygwin or a similar emulated shell? There have been issues reported with the way these shells interact with Windows. I've seen discussions where using winpty to prefix the command solves the issue, like so: #4946
Also, you may be able to use Powershell in admin mode, depending on the command and how you're executing it. Let me know if you have more problems. Thanks!

@ferAbleTech
Copy link
Author

Hi @hsimon-hashicorp, I don't think it's related as I'm using the native shell.
The strange thing about the command (vault operator raft join...) is that when forming the cluster for the first time it works normally, but after forming a cluster by joining the recovered node it throws those errors.

Also, even using Powershell in admin mode to execute the command gives the same results.

@ncabatoff
Copy link
Collaborator

The strange thing about the command (vault operator raft join...) is that when forming the cluster for the first time it works normally, but after forming a cluster by joining the recovered node it throws those errors.

Yeah, it's only when restoring a snapshot that we do renames, to try to move the new file atomically into place. And it turns out that's... not possible in Windows?

https://github.com/google/renameio says:

It is not possible to reliably write files atomically on Windows

and then links to one of the Go authors saying golang/go#22397 (comment):

it appears not to be possible to write a Windows atomic-rename library that avoids spurious errors in all cases: even using the recommended ReplaceFile system call, readers can still observe any of at least three errors under high write loads.

So that's a bummer. I'm going to have to think about what we might do instead to install snapshots. Note that snapshots sometimes get installed on nodes even without doing a recover operation or an explicit restore. So until we fix this issue it's probably not a good idea to try to run integrated storage on windows.

@ncabatoff
Copy link
Collaborator

Good news, it sounds like the fact that atomic renames are not 100% safe in all conditions isn't the issue, and the associated caveats aren't relevant to our use case. Moreover, they're not the source of the bug you ran into. The only connection is that we're doing something in striving to make our renames atomic which doesn't work properly on Windows. The Consul team ran into this years ago and fixed it (hashicorp/raft#241, hashicorp/raft#243), we're going to adopt their solution. I'll start work on a fix tomorrow.

@ferAbleTech
Copy link
Author

ferAbleTech commented Jul 21, 2021

Great, thank you very much for the update and the fix!

@pmmukh
Copy link
Contributor

pmmukh commented Aug 20, 2021

This issue has been fixed in #12377

@pmmukh pmmukh closed this as completed Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug os/windows storage/raft
Projects
None yet
Development

No branches or pull requests

4 participants