Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockstore: atomize slot clearing, relax parent slot meta check #35124

Merged
merged 8 commits into from
Mar 3, 2024

Conversation

AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Feb 7, 2024

clear_unconfirmed_slot can leave blockstore in an irrecoverable state
if it panics in the middle. write batch this function, so that any
errors can be recovered after restart.

Copy link

codecov bot commented Feb 7, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (6ee3bb9) to head (bc0f888).
Report is 8 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #35124    +/-   ##
========================================
  Coverage    81.7%    81.8%            
========================================
  Files         834      834            
  Lines      224299   224828   +529     
========================================
+ Hits       183361   183923   +562     
+ Misses      40938    40905    -33     

@AshwinSekar AshwinSekar marked this pull request as ready for review February 7, 2024 17:29
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
@AshwinSekar AshwinSekar marked this pull request as draft February 7, 2024 19:30
@AshwinSekar AshwinSekar changed the title blockstore: update next_slots during failure even if missing parent blockstore: atomize slot clearing, relax parent slot meta check Feb 7, 2024
@AshwinSekar AshwinSekar marked this pull request as ready for review February 7, 2024 21:41
@AshwinSekar
Copy link
Contributor Author

reworked as per #35124 (comment)
the writes now happen atomically

@AshwinSekar AshwinSekar added v1.17 PRs that should be backported to v1.17 v1.18 PRs that should be backported to v1.18 labels Feb 8, 2024
Copy link
Contributor

mergify bot commented Feb 8, 2024

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

Copy link
Contributor

mergify bot commented Feb 8, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

ledger/src/blockstore/blockstore_purge.rs Show resolved Hide resolved
Comment on lines 285 to 290
if let Some(mut slot_meta) = slot_meta {
let child_slot = slot_meta.slot;
if child_slot != from_slot || child_slot != to_slot {
error!("Slot meta parent cleanup was requested for {}, but a range was specified {} {}", child_slot, from_slot, to_slot);
return Err(BlockstoreError::InvalidRangeForSlotMetaCleanup);
}
Copy link
Contributor

@steviez steviez Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to do this sanity checking if we fetch the SlotMeta in this function ourselves instead of relying on the caller

Technically, this would add some overhead since we'd be fetching the SlotMeta twice, once here and once in Blockstore::clear_unconfirmed_slot(). I think that's ok, but if we're conscious about that, I think we could remove the SlotMeta check in Blockstore::clear_unconfirmed_slot() anyways. The rocksdb range deletes are cheap, and this function should only be getting called if ReplayStage knows it has something that it needs to delete.

That being said, fetching a single SlotMeta is cheap too and if we're calling this function, stuff is kind of messed up already (ie the node is likely going to have to repair the slot)

ledger/src/blockstore/blockstore_purge.rs Outdated Show resolved Hide resolved
@AshwinSekar AshwinSekar force-pushed the purge-handling branch 2 times, most recently from 601f68c to 45e292d Compare February 21, 2024 21:36
@AshwinSekar
Copy link
Contributor Author

AshwinSekar commented Feb 21, 2024

In the interest of keeping the backport surface small, i've added a new PurgeType variant, and fetched the slot_meta within the purge function.

In a future master PR I can unify the remaining PurgeType::Exact callers as suggested here #35124 (comment) and remove this variant. tracking in #35281

@AshwinSekar AshwinSekar force-pushed the purge-handling branch 2 times, most recently from 1b73953 to 6b496a5 Compare February 25, 2024 23:17
@@ -273,6 +264,72 @@ impl Blockstore {
Ok(columns_purged)
}

fn purge_range(&self, write_batch: &mut WriteBatch, from_slot: Slot, to_slot: Slot) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to include purge_special_columns_exact() in here?

fn purge_range() {}
fn purge_range_special_columns(){}
fn do_purge_range(&self, write_batch: &mut WriteBatch, from_slot: Slot, to_slot: Slot, should_purge_special_columns: bool)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, i just passed through PurgeType instead of separate functions.

ledger/src/blockstore/blockstore_purge.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits, liking the way this is coming together. Will take a final pass shortly

ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore/blockstore_purge.rs Outdated Show resolved Hide resolved
ledger/src/blockstore/blockstore_purge.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more nits on comments + simplification. Think we're good for a ship-it after these changes; I can give that quickly after you make the changes but think it'd be nice to get a final pass & approval from Carl too

ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore/blockstore_purge.rs Outdated Show resolved Hide resolved
ledger/src/blockstore/blockstore_purge.rs Outdated Show resolved Hide resolved
ledger/src/blockstore/blockstore_purge.rs Show resolved Hide resolved
ledger/src/blockstore/blockstore_purge.rs Outdated Show resolved Hide resolved
clear_unconfirmed_slot can leave blockstore in an irrecoverable state
if it panics in the middle. write batch this function, so that any
errors can be recovered after restart.

additionally relax the constraint that the parent slot meta must exist,
as it could have been cleaned up if outdated.
Co-authored-by: steviez <stevecz@umich.edu>
Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for pushing through after we collectively changed our mind several times. Just to be sure, let's see if we can get @carllin to give one more pass too

write_batch: &mut WriteBatch,
from_slot: Slot,
to_slot: Slot,
purge_type: PurgeType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it's cleaner encapsulation to pass a should_purge_special_columns boolean here, and then have a function PurgeType::should_should_purge_special_columns() -> bool method on PurgeType. This way purge_slot_cleanup_chaining() doesn't have to know about PurgeType at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way purge_slot_cleanup_chaining() doesn't have to know about PurgeType at all

I see the merit of adding the new function purge_slot_cleanup_chaining() as this has the unique functionality from run_purge_with_stats. However, I'm not sure I see the benefit of the boolean over the enum; can you elaborate?

The enum is part of the public API, so I think it is reasonable to expect someone to know about it. And the Exact value of the enum means "go purge the special columns for these slots right now", which is what the boolean would convey.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It seems like the PurgeType was designed specifically for the run_purge_with_stats workflow, so it seemed better to keep it isolated there rather than mixing it with this new clean function
  2. purge_slot_cleanup_chaining as a utility function seems like it should just decide which columns to clean. Seems like cramming the PurgeType in there was too high level.

@AshwinSekar AshwinSekar merged commit cc4072b into solana-labs:master Mar 3, 2024
35 checks passed
mergify bot pushed a commit that referenced this pull request Mar 3, 2024
* blockstore: atomize slot clearing, relax parent slot meta check

clear_unconfirmed_slot can leave blockstore in an irrecoverable state
if it panics in the middle. write batch this function, so that any
errors can be recovered after restart.

additionally relax the constraint that the parent slot meta must exist,
as it could have been cleaned up if outdated.

* pr feedback: use PurgeType, don't pass slot_meta

* pr feedback: add unit test

* pr feedback: refactor into separate function

* pr feedback: add special columns to helper, err msg, comments

* pr feedback: reword comments and write batch error message

* pr feedback: bubble write_batch error to caller

* pr feedback: reword comments

Co-authored-by: steviez <stevecz@umich.edu>

---------

Co-authored-by: steviez <stevecz@umich.edu>
(cherry picked from commit cc4072b)

# Conflicts:
#	ledger/src/blockstore.rs
#	ledger/src/blockstore/blockstore_purge.rs
mergify bot pushed a commit that referenced this pull request Mar 3, 2024
* blockstore: atomize slot clearing, relax parent slot meta check

clear_unconfirmed_slot can leave blockstore in an irrecoverable state
if it panics in the middle. write batch this function, so that any
errors can be recovered after restart.

additionally relax the constraint that the parent slot meta must exist,
as it could have been cleaned up if outdated.

* pr feedback: use PurgeType, don't pass slot_meta

* pr feedback: add unit test

* pr feedback: refactor into separate function

* pr feedback: add special columns to helper, err msg, comments

* pr feedback: reword comments and write batch error message

* pr feedback: bubble write_batch error to caller

* pr feedback: reword comments

Co-authored-by: steviez <stevecz@umich.edu>

---------

Co-authored-by: steviez <stevecz@umich.edu>
(cherry picked from commit cc4072b)
AshwinSekar added a commit to AshwinSekar/solana that referenced this pull request Mar 4, 2024
…na-labs#35124)

* blockstore: atomize slot clearing, relax parent slot meta check

clear_unconfirmed_slot can leave blockstore in an irrecoverable state
if it panics in the middle. write batch this function, so that any
errors can be recovered after restart.

additionally relax the constraint that the parent slot meta must exist,
as it could have been cleaned up if outdated.

* pr feedback: use PurgeType, don't pass slot_meta

* pr feedback: add unit test

* pr feedback: refactor into separate function

* pr feedback: add special columns to helper, err msg, comments

* pr feedback: reword comments and write batch error message

* pr feedback: bubble write_batch error to caller

* pr feedback: reword comments

Co-authored-by: steviez <stevecz@umich.edu>

---------

Co-authored-by: steviez <stevecz@umich.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.17 PRs that should be backported to v1.17 v1.18 PRs that should be backported to v1.18
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants