Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quick validator key swap #11264

Open
wacban opened this issue May 8, 2024 · 5 comments
Open

quick validator key swap #11264

wacban opened this issue May 8, 2024 · 5 comments
Assignees

Comments

@wacban
Copy link
Contributor

wacban commented May 8, 2024

Make it possible to quickly move validator keys from one node to another. The goal is to allow validators to perform common maintenance operations with no downtime.

In a typical scenario a validator needs to restart their node. The reason may be a new neard release, need to update configs or any issue that may be causing the node to misbehave. In such circumstance the node operator runs two nodes - the old one with the validator keys and the new one to get it warm and ready. Once the new node is ready the operator stops the old node, moves the validator keys to the new node and restarts the new node. Unfortunately restarting the node may take some time and this will get worse once the memtrie is release. This issue is to allow to move the validator keys from one node to another quickly.

It's important to make sure that no two nodes have the validator keys at the same time as those would both produce blocks and chunks which can be considered as malicious behaviour.

So far the best ideas to implement this are:

  • Allow node to hot load the validator keys upon receiving a signal. We already have a similar solution for some config fields. The operator can start the new node without validator keys, copy them from old node, stop the old node, signal the new node to load the keys.
  • Allow operator to configure some predefined height at which the validator keys will become valid. The operator can start the new node with the validator keys and start height in the future and stop the old node right before that height.
@staffik staffik self-assigned this May 9, 2024
@staffik
Copy link
Contributor

staffik commented May 13, 2024

The plan is to start with:

  • Integration test (pytest / nayduck) that fails.
  • Draft implementation so that the test passes.
  • First round of review (on the approach).

If looks fine then:

  • Adversenet testing.
  • Forknet testing.
  • Final review.

@staffik
Copy link
Contributor

staffik commented May 15, 2024

2024-05-15 (Wednesday) Update

  • Added integration test for the feature, it fails ofc.
  • Loading validator key on SIGHUP, but it is not picked by client yet.

Changes:
staffik/nearcore@master...staffik:bcd14a28bb8cca2a7e005ad8347c458f6bb6cca1

Working on extending existing UpdateableConfigs so that we can update the client with the new validator key.

@wacban
Copy link
Contributor Author

wacban commented May 20, 2024

@saketh-are Is there anything that we need to do on the networking side? For example announce that that a different node has the validator keys?

@saketh-are
Copy link
Collaborator

I don't believe any changes will be needed on the networking side.

Every node in the network has either 1 or 2 public keys associated with it:

  • Every node has a peer id, a public key identifying the node itself
  • Some nodes also have an account key, a public key identifying the validator account being handled by the node

Every node maintains in memory a mapping of the AccountKey -> PeerId relationships it is aware of. There is some logic already implemented which allows the mapping to be updated if a validator key starts to be hosted by a different peer id.

The things to make sure of are:

  • Running two nodes at the same time with the same peer id is not supported behavior at the network layer.
  • As you said, it's also important to make sure that no two nodes operate the same validator account at the same time.
  • If you plan to hotswap onto a node it should be given sufficient time to form network connections; starting up the backup node on-demand will be worse than simply restarting the original node. Ideally the backup node would be running for 10+ minutes already and the operator may wish to confirm that it has a reasonable number of peers before hotswapping onto it.

@staffik
Copy link
Contributor

staffik commented May 21, 2024

2024-05-21 (Tuesday) Update

  • Extended UpdateableConfigs with validator key, so that it can be updated dynamically.
  • Refactored Signer and ValidatorSigner in order to be able to wrap validator key with MutableConfigValue.

Changes:
staffik/nearcore@staffik:bcd14a28bb8cca2a7e005ad8347c458f6bb6cca1...staffik:9ea088938a6d6d4761100bd606de18288a23d107

However, the chain is stuck after the hotswap. Investigating why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants