Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize chunk producer assignments across every epoch #11213

Closed
tayfunelmas opened this issue May 2, 2024 · 8 comments · Fixed by #11322
Closed

Stabilize chunk producer assignments across every epoch #11213

tayfunelmas opened this issue May 2, 2024 · 8 comments · Fixed by #11322
Assignees
Labels
A-stateless-validation Area: stateless validation

Comments

@tayfunelmas
Copy link
Contributor

Opening this to track the work based on this thread.

Note that we are disabling the shard-assignment shuffling in #11190. However even after this change, there will still be chunk-producer assignment changes at epoch boundaries. The relevant analysis is being done in #11204. The goal of this task is to limit shard-assignment changes further to provide a deterministic algorithm that will keep the assignment stable across epochs.

@tayfunelmas tayfunelmas added the A-stateless-validation Area: stateless validation label May 2, 2024
@walnut-the-cat
Copy link
Contributor

Depends on #11204

@walnut-the-cat
Copy link
Contributor

Related to #11202 and #11190

@Longarithm
Copy link
Member

Longarithm commented May 14, 2024

New chunk producer assignments algorithm

With requirement of sticky chunk producer assignment, the existing code basically doesn't work, so we needed to come up with some new idea. We discussed whether we need to keep minimum_validators_per_shard (below) and decided to keep it.

Requirements, in order of priority:

  1. satisfy minimum_validators_per_shard condition: each shard must have at least minimum_validators_per_shard different chunk producers assigned;
  2. then, minimise amount of validator repeats. By this I mean sum_validators { max(number of shards to which validator is assigned - 1, 0) }. For example, assignment of 3 validators to 2 shards [[0, 1], [1, 2]] has 1 repeat; [[0, 0], [0, 1]] has 2 repeats. We didn't really discuss this but it makes sense to me because each shard repeat dramatically increases RAM requirements for a chunk producer and we want to get rid of repeats ASAP.
  3. then, minimise amount of state syncs after epoch switch.

Algorithm

Step 1. Take previous chunk producer assignment. If shard layout in the previous epoch changed, we assume it was empty. This could be smarter and take parent shard ids into account, but we'll care about it when resharding will have to be supported.

  1. Remove validators which are not selected as chunk producers for the new epoch.

  2. If num_validators < minimum_validators_per_shard, return Error. Note that in mainnet and testnet minimum_validators_per_shard = 1, and we always have at least one validator, so it never happens; in tests, test author must guarantee that number of validators satisfies this requirement. Put validators into the map, sorted by amount of shards they produce chunks for. Put shards into a map (binary heap), sorted by amount of chunk producers they have.

Satisfy requirement 1:

  1. Take shard with minimal amount of chunk producers. If this is >= minimum_validators_per_shard, stop. Otherwise,
    4.1. Take validator with minimal amount of assigned shards. If it is already assigned to this shard, take next validator. Otherwise assign it and stop.
    4.2. Because num_validators >= minimum_validators_per_shard, 4.1 succeeds at some point. Repeat step 4.

Note that it may not end up with the most fair assignment. If we have 4 shards, 4 validators, 2 minimum seats and current assignment is [[0, 1], [1, 2], [0, 2], [3, x]], then x is going to be 0, 1 or 2 and assignment is unfair. But I want to ignore that, because it's pathological case anyway, and with more joining validators next step resolves that unfairness.

Satisfy requirement 2:

  1. Take validator with maximal number of assigned shards. If this is 1, stop - we are done. Otherwise, take validator with minimal number of assigned shards. If this is >= 1, stop - we can't do anything. Otherwise, take random shard from maximal validator and reassign it to minimal validator. Repeat step 5.

Each successful iteration decreases number of repeats by 1.

Satisfy requirement 3:

  1. Put shards into a map (binary heap), sorted by amount of stake of chunk producers assigned.
  2. Take validator with minimal number of assigned shards. If this is >= 1, stop - we assigned everyone. Otherwise, take shard with minimal stake and assign validator to it. Repeat step 7.

In the end we achieve minimal number of state syncs, because:

  • validators which don't track any shard has to be assigned to some shard and sync to it, so this is necessary;
  • validators which already track shards cannot be unassigned from some shard and get assigned to new one, because we unassign shards only at step 5, and only to validators which didn't track shards before.

Notes:

  • We don't necessarily have to take stakes into account, but this should make protocol more stable, as validator with more stake is more incentivised to stay online. In the regular scenario, some validators just leave old assignment and new validators fill their places based on stake.
  • In the bad scenario, when we mostly assign shards to satisfy requirements 1-2, stakes can be very different, and we consciously don't take this into account, because this case is pathological, and on mainnet minimum_validators_per_shard * num_shards is way lower than number of chunk producers, so requirement 3 should be enough to balance stakes.
  • If we have 5 validators with stakes [4, 1, 1, 1, 1] and 2 shards, then [[0], [1, 2, 3, 4]] is likely to be generated, which is now the case as well.

@Longarithm
Copy link
Member

Longarithm commented May 14, 2024

Chunk producer assignments algorithm v2

New considerations

The feedback was that v1 didn't consider the following case:
We have 5 shards, 10 validators and assignment is [[0, 1, 2, 3, 4, 5], [6], [7], [8], [9]]. Then validator 9 goes offline. It is more reasonable to reassign some of validators 0-5 to shard 4 but v1 doesn't handle it on step 5. So we end up with something like [[0, 1, 2, 3, 4, 5], [6], [7], [8], [8]], which adds unnecessary repeat.
It is not even completely impossible - if some shard suffers on high load and most of its chunk producers on it got kicked out, it may make sense to reassign chunk producers from other shards to this one.

At the same time, we still keep minimum_validators_per_shard. Now, to simplify logic, if num_validators < minimum_validators_per_shard * num_shards, we'll use part of old algorithm which assigns validators with repeats. Otherwise, we'll use completely new idea described below.
One drawback is that if we go from one mode to another, lots of state syncs may be required. But we claim that the first situation is very unlikely on testnet/mainnet, and if it actually happens, then it is in very bad state as minimum_validators_per_shard = 1, and then state sync is not the biggest concern.

New algorithm idea

First, the easy steps. Take previous chunk producer assignment. If shard layouts in previous and current epoch are different, we assume that assignment was empty. This could be smarter and take parent shard ids into account, but we'll care about it when resharding will have to be supported.

Then we remove validators which are not selected as chunk producers for the new epoch (by going offline or by decreased stake). After that, we optimise on performing the following two kinds of operations in the right way:

  • assign new validator to shard X
  • reassign old validator from shard X to shard Y

Note that in scope of this task, we optimise three parameters which matter for us:

  1. S = number of state syncs
  2. D = (max_{shard X} {num validators tracking X} ) - (min_{shard X} {num validators tracking X} )
  3. T = (max_{shard X} {stake tracking X} ) - (min_{shard X} {stake tracking X} )

For simplicity we don’t consider T.

First, we have some new validators to assign. They state sync anyway. Sort them by (reversed stake, id) for clarity. Use them to optimise D by iteratively assigning to shard tracked by minimal number of validators ("minimal shard").

Then, the only way to minimise D (bring it to 0 or 1) is to reassign some random validator from maximal shard to minimal shard. Each such operation increases S by 1. So resulting D is defined by how many state syncs per epoch we allow. Define it as S_max.

Then condition C for reassigning is "we reassign until (minimal shard has < minimum_validators_per_shard validators) OR ((D is not minimal) AND (S < S_max))".

Algorithm

  1. Cleanup previous chunk producer assignment, as described above;
  2. Iteratively assign new validators, sorted by reversed stake, to minimal shard;
  3. Iteratively reassign old validator while condition C is true.

Notes

  • Each S_max will give different distribution of D-s for each epoch. I will look at historical data to see how it works, I expect S_max <= 10. I will look at T to check if large discrepancies can happen.
  • Reassigning could optimise T as well. For example, when selecting validator to reassign, we can select one with minimal/maximal stake, but it is not clear if it is better. The first one doesn't impact T much, the second one may lead to validators with biggest stake to state sync more frequently, which is not great. We could add an operation "swap validators", which could help with rebalancing.

@tayfunelmas
Copy link
Contributor Author

tayfunelmas commented May 15, 2024

Thanks Alex, this looks much simpler. If I understand correctly, the new algorithm will be default and the previous one will only apply if the num producers go below a certain threshold? Could there be a way to modify the new algorithm to support the case of num producers going under a limit instead of running two separate algorithms?

Some questions:
D = (max num validators tracking shard - min num validators tracking shard)
do you mean the shard tracked by max number of chunk-producers and the shard tracked by min number of chunk-producers? basically how much gap is there between shard-load for producers?

T = (max stake tracking shard - min stake tracking shard)
Similar question.

Then condition C for reassigning is "we reassign until (minimal shard has < minimum_validators_per_shard validators) OR ((D is not minimal) AND (S < S_max))".
So we want to set S_max to some fixed value right? Can we see if we can get an upper bound for S_max that satisfies the other two conditions (which are OR'ed)?

I think it would be useful in general to find a way to take data from N epochs and run the algorithm on it, not sure what is the right way to capture such information (eg. a sequence of ValidatorStakes for N epochs and the assignment for the initial state).

@Longarithm
Copy link
Member

Longarithm commented May 15, 2024

Could there be a way to modify the new algorithm to support the case of num producers going under a limit instead of running two separate algorithms?

I don't see one. The issue is that it would add another parameter to optimise, number of validator repeats in the assignment - it would make logic harder.
For example, we would need to guarantee that:

  • when number of validators grows above the limit, we'd like to have 0 repeats
  • when this number falls below the limit, we need to assign some old validators to additional shards, keeping old shards, this is not reassigning

So we need some smooth transition of assignment supporting adding repeats and getting rid of them. I doubt that complexity of resulting algorithm is lower, so I suggest to treat this corner case separately.

do you mean the shard tracked by max number of chunk-producers and the shard tracked by min number of chunk-producers?

Yes (and yes). Changed wording a bit

So we want to set S_max to some fixed value right?

Yes, we can choose it dynamically as well if needed, but I doubt it is needed.

Can we see if we can get an upper bound for S_max that satisfies the other two conditions (which are OR'ed)?

I don't get the question. S_max can be 100 on mainnet, which would mean that we can reassign everyone to achieve minimal D.

@tayfunelmas
Copy link
Contributor Author

Thanks for the answers and the offline discussion, I now have a good understanding of the differences and similarities of the both algorithms and we can proceed with the implementation.

@Longarithm
Copy link
Member

Longarithm commented May 20, 2024

Evaluation of new algorithm

We analyse how well algorithm behaves on the last ~750 epochs, on epoch heights from 1200 to 2507, which is around 2 years of mainnet data. In short, the new algorithm looks good for our needs.

Results

We look at three factors, following the previous logic: number of state syncs, number of validators per shard, stake imbalance.

State syncs

Quick look at old algo shows that number of state syncs is wild, consistently requiring ~50 chunk producers to download new shard in next epoch.
For new algo, average number of state syncs is 0.9 per epoch, so on average we should expect only one validator to start tracking new shard.

Validators number

In the old algo, maximal difference in chunk producer number tracking shard is around 1-4, which makes sense, because algorithm cares about stakes, not validators number.
The new algo explicitly cares about validator number, and maximal difference is always 1, because number of chunk producers 100 is not divisible by 6. Anyway, both algorithms give numbers way above 3 which could be critical for data availability.

Stake difference

The old algo optimises stake balance, so it is more than expected that it performs great. Relative (mix - min) / max stake difference is below 1%.
The new algorithm doesn't explicitly balance stakes. However, even with that, maximal relative stake difference would be 66%. Recently it even went to 31% (not sure why). The minimal stake ever assigned to shard would be 36M NEAR, which seems more than enough for data availability concerns.
66% is definitely worse than 1%, but we explicitly stated that precise balance of assigned stakes doesn't matter for us in the near future. Simply saying, we don't care if we have two shards tracked by 36M and 360M stakes in the same epoch. So the new algo looks fine.

Methodology

Hacky tool which reruns near_epoch_manager::proposals_to_epoch_info, to which we always put chunk producer assignment we previously generated. We pretend that we always had 6 shards and 100 seats for chunk producers.

Command: ./target/dev-release/neard view-state epoch-analysis --height 1200

old_algo.csv
new_algo.csv

github-merge-queue bot pushed a commit that referenced this issue May 22, 2024
Using stable (sticky) chunk producer assignments since stateless
validation protocol version, applying algorithm from
#11213 (comment).

New algorithm properties:
* tries to balance numbers of chunk producers for shards, **ignoring the
stakes**. Comparison of stake diffs for old and new algo:
#11213 (comment)
* minimises number of state syncs to be made, setting limit caused by
reassignments to 5 in epoch config. Figuring out exact number is a TODO
as well, but for mainnet it shouldn't matter because we have way more
validator proposals than planned chunk producers (100) at every epoch.

Old assignment algorithm is moved behind `old_validator_selection`. The
part for assigning validators with repeats is still used though.

The main function is `assign_chunk_producers_to_shards` which is
comprehensively tested. +737 lines are scary but the algorithm itself is
+300 lines with comments and tests are another +250.
`get_chunk_producers_assignment` is separated into a function because it
became too big.
marcelo-gonzalez pushed a commit to marcelo-gonzalez/nearcore that referenced this issue May 23, 2024
Using stable (sticky) chunk producer assignments since stateless
validation protocol version, applying algorithm from
near#11213 (comment).

New algorithm properties:
* tries to balance numbers of chunk producers for shards, **ignoring the
stakes**. Comparison of stake diffs for old and new algo:
near#11213 (comment)
* minimises number of state syncs to be made, setting limit caused by
reassignments to 5 in epoch config. Figuring out exact number is a TODO
as well, but for mainnet it shouldn't matter because we have way more
validator proposals than planned chunk producers (100) at every epoch.

Old assignment algorithm is moved behind `old_validator_selection`. The
part for assigning validators with repeats is still used though.

The main function is `assign_chunk_producers_to_shards` which is
comprehensively tested. +737 lines are scary but the algorithm itself is
+300 lines with comments and tests are another +250.
`get_chunk_producers_assignment` is separated into a function because it
became too big.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants