Introduce SchedulingStateMachine for unified scheduler #129

ryoqun · 2024-03-07T05:26:26Z

(let's finish up the review at the original pr; then just r+ on this pr; sans unfortunate rebase due to ci; this branch will be sync with the original branch)

perf numbers

before:

[2024-04-05T03:35:52.628226692Z INFO  solana_ledger::blockstore_processor] ledger processed in 37 seconds, 775 ms, 302 µs and 825 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

after:

[2024-04-05T03:24:34.316539714Z INFO  solana_ledger::blockstore_processor] ledger processed in 21 seconds, 643 ms, 40 µs and 195 ns. root slot is 249712363, 36 banks: 249712363, 249712364, 249712365, 249712366, 249712367, 249712368, 249712369, 249712370, 249712371, 249712372, 249712373, 249712374, 249712375, 249712378, 249712379, 249712380, 249712381, 249712382, 249712383, 249712384, 249712385, 249712386, 249712387, 249712388, 249712389, 249712390, 249712391, 249712392, 249712393, 249712394, 249712395, 249712396, 249712397, 249712398, 249712399, 249712400

codecov-commenter · 2024-03-07T07:00:32Z

Codecov Report

Attention: Patch coverage is 96.96510% with 20 lines in your changes are missing coverage. Please review.

Project coverage is 81.9%. Comparing base (cc3afa5) to head (8c0c6bd).
Report is 56 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master     #129    +/-   ##
========================================
  Coverage    81.9%    81.9%            
========================================
  Files         837      837            
  Lines      226823   227455   +632     
========================================
+ Hits       185828   186415   +587     
- Misses      40995    41040    +45

alessandrod · 2024-03-16T09:23:36Z

I've just left my review on solana-labs#35286 (review) which is what I was tagged on. Hopefully this PR matches the same code 😅

ryoqun · 2024-03-18T08:14:57Z

... Hopefully this PR matches the same code 😅

I'll keep them in-sync. :) fyi, you can confirm my diligence by this:

$ git range-diff ryoqun/scheduling-state-machine...ryoqun/scheduling-state-machine-for-agave

alessandrod · 2024-03-22T10:28:56Z

unified-scheduler-logic/src/lib.rs

+        ///
+        /// Note that lifetime of the acquired reference is still restricted to 'self, not
+        /// 'token, in order to avoid use-after-free undefined behaviors.
+        pub(super) fn with_borrow_mut<R>(


I think I might be missing something, but what prevents this

struct FakeQueue { v: Vec<u8>, } let queue = Arc::new(TokenCell::new(FakeQueue { v: Vec::new() })); let q1 = Arc::clone(&queue); std::thread::spawn(move || { let mut token = unsafe { Token::<FakeQueue>::assume_exclusive_mutating_thread() }; q1.with_borrow_mut(&mut token, |queue| { // this is UB let v = &mut queue.v; }); }); std::thread::spawn(move || { let mut token = unsafe { Token::<FakeQueue>::assume_exclusive_mutating_thread() }; queue.with_borrow_mut(&mut token, |queue| { // this is UB let v = &mut queue.v; }); });

There's one token per thread, so I'm respecting the Token::assume_blah()
contract. I have a token, so I'm respecting the TokenCell::with_borrow_mut contract.

hmm, the above example is violating this #129 (comment):

/// By extension, it's allowed to create _multiple_ tokens in a _single_ process as long as no /// instance of [`TokenCell`] is shared by multiple instances of [`Token`].

i.e. the Arc-ed TokenCell is shared by the tokens in the two threads.

as a quick recap, the objective of this fancy TokenCell is to offload creation of them to multiple threads. those threads doesn't need mutable references, so no tokens needed for them. on the other hand, the scheduler thread needs mutability. and it's the sole thread to mutate all of those objects created ever. in this way, i want to avoid Mutex-ing things.

I've re-read that comment a few times and I still don't understand it to mean what you suggest it should mean. I can't create multiple Token's in the same thread because the constructor asserts that. And the Arc is shared among threads the same way UsageQueue is shared among threads. I understand what you're saying, but from the API and docs you're providing me, I don't think I'm doing anything wrong by sharing the TokenCell among threads, I thought that was the whole point.

as a quick recap, the objective of this fancy TokenCell is to offload creation of them to multiple threads. those threads doesn't need mutable references, so no tokens needed for them. on the other hand, the scheduler thread needs mutability. and it's the sole thread to mutate all of those objects created ever. in this way, i want to avoid Mutex-ing things.

Ok, but now you've created TokenCell, which claims to be Send and Sync, but it's in fact not Sync. The documentation claims that the associated Token API is what ensures that there's ever only one live mutable access to the inner value, when it doesn't. In fact as far as I can tell, the token just acts as an inconvenience/deterrent within one thread, so callers are somehow alerted that something dangerous is happening. But then the same could be achieved without a token at all, just by calling the accessors borrow_but_this_is_really_dangerous() borrow_mut_this_is_even_more_dangerous().

I don't know enough about how these UsageQueues are being instantiated, used etc (feel free to link to code if you want to), but my feeling is that we should be using a Mutex, wrapped in a wrapper that always uses try_lock and asserts that the Mutex is in fact never contended. That way you create a Sync API that is 100% safe, at the cost of a compare and swap, and - again without currently knowing much about usage patterns - I'd bet 1 BONK that it wouldn't have a significant performance impact.

Alternatively if you don't want to pay the price of CAS at all, I think you should look into moving these UsageQueues by value so that they don't need to be Sync at all. IMO creating an API which looks and sounds safe to be used by multiple threads, but in fact very easily introduces UB when used by multiple threads unless "held correctly", is not something we should do, especially in new code.

Given hyper-sensitivity of scheduler it might be that the Mutex::try_lock CAS does have an impact - but would be good to measure that impact if we can avoid the unsafe here.

I think the ultimate usage of the unsafe-ness works as currently written, because we only allow token access through &mut self fns on the SchedulingStateMachine. But introducing something that other devs are confused about makes this more challenging to maintain.

@alessandrod and @apfitzge: really thanks for spending time on reviewing this.

I've re-read that comment a few times and I still don't understand it to mean what you suggest it should mean. I can't create multiple Token's in the same thread because the constructor asserts that. And the Arc is shared among threads the same way UsageQueue is shared among threads. I understand what you're saying, but from the API and docs you're providing me, I don't think I'm doing anything wrong by sharing the TokenCell among threads, I thought that was the whole point.

okay, in this case, i think doc is worded badly. I'll try to rearrange it.

as a quick recap, the objective of this fancy TokenCell is to offload creation of them to multiple threads. those threads doesn't need mutable references, so no tokens needed for them. on the other hand, the scheduler thread needs mutability. and it's the sole thread to mutate all of those objects created ever. in this way, i want to avoid Mutex-ing things.

Ok, but now you've created TokenCell, which claims to be Send and Sync, but it's in fact not Sync. The documentation claims that the associated Token API is what ensures that there's ever only one live mutable access to the inner value, when it doesn't.

as long as the (poorly-written) user contract is properly adhered, there should only be one.

In fact as far as I can tell, the token just acts as an inconvenience/deterrent within one thread, so callers are somehow alerted that something dangerous is happening. But then the same could be achieved without a token at all, just by calling the accessors borrow_but_this_is_really_dangerous() borrow_mut_this_is_even_more_dangerous().

yep. it's just deterrent. however, consider this:

this would pass compilation. yet this is ub because of rust's aliasing rule violation:

// assume arced_token_cell111 and arced_token_cell222 could point to same instance. let a = arced_token_cell111.borrow_mut_this_is_even_more_dangerous(); for ... { let b = arced_token_cell222.borrow_mut_this_is_even_more_dangerous(); }

but, the inconvenient TokenCell can prevent this from happening accidentally. in fact, i did so while coding. that's why i introduced this because it has no runtime cost.

I don't know enough about how these UsageQueues are being instantiated, used etc (feel free to link to code if you want to), but my feeling is that we should be using a Mutex, wrapped in a wrapper that always uses try_lock and asserts that the Mutex is in fact never contended. That way you create a Sync API that is 100% safe, at the cost of a compare and swap, and - again without currently knowing much about usage patterns - I'd bet 1 BONK that it wouldn't have a significant performance impact.

sadly, even a single cas will have a significant performance impact: solana-labs#35286 (comment)

this piece of code must wade through 1 million transactions or more per sec for line-rate at the very least to find most-paying transactions, when it is sited at the front of the banking stage (in the future). And more throughput is better. and there's no enough. ;)

Alternatively if you don't want to pay the price of CAS at all, I think you should look into moving these UsageQueues by value so that they don't need to be Sync at all. IMO creating an API which looks and sounds safe to be used by multiple threads, but in fact very easily introduces UB when used by multiple threads unless "held correctly", is not something we should do, especially in new code.

UsasaQueue::blocked_usages must be some kind of a collection. Currently, vanilla VecDequeue. so, move by value isn't a viable option, either...

Given hyper-sensitivity of scheduler it might be that the Mutex::try_lock CAS does have an impact - but would be good to measure that impact if we can avoid the unsafe here.

the non-trivial overhead is measured here as a cas in Arc, (basically same): solana-labs#35286 (comment)

I think the ultimate usage of the unsafe-ness works as currently written, because we only allow token access through &mut self fns on the SchedulingStateMachine. But introducing something that other devs are confused about makes this more challenging to maintain.

i admit this is confusing as hell. but i want to ship it for the merits. I'll strive TokenCell to be documented clearly and it's not general purpose library. its use is very local to the unified scheduler code only.

VecDeque is actually pretty fast to move? In fact, doesn't this whole problem go away if instead of Arcs to share between other threads and the state machine you move Boxes back and forth?

ryoqun · 2024-03-22T13:28:09Z

unified-scheduler-logic/src/lib.rs

+    /// By extension, it's allowed to create _multiple_ tokens in a _single_ process as long as no
+    /// instance of [`TokenCell`] is shared by multiple instances of [`Token`].


alessandrod · 2024-03-23T05:48:53Z

What is the advantage of having UsageQueueLoader outside SchedulingStateMachine? In the code that exists right now UsageQueueLoader is a dashmap, which is a type with horrible performance characteristics. It seems to me that if you move it inside SchedulingStateMachine then it can be a regular hash map, tokencell isn't needed, and everything gets faster?

ryoqun · 2024-03-23T07:13:48Z

UsageQueueLoader is a dashmap, which is a type with horrible performance characteristics.

you mean complaints like this? solana-labs#34955

i haven't evaluated DashMap in depth. just picked it as it's popular in the codebase. so, my mind hasn't set in dashmap. I just need a fast & scalable multithreaded hash table.

It seems to me that if you move it inside SchedulingStateMachine then it can be a regular hash map, tokencell isn't needed

this is correct so far... yet...

..., and everything gets faster?

this isn't.

I don't think a single-threaded vanilla HashMap can compete with 20-threaded DashMap-ed loader. The multi-threaded loader only incurs overhead of 1 crossbeam::Channel::recv(), which is exactly non-contended 2 cas per a task for the scheduler thread under busy-looping or saturation for this MPSC channel arrangement, regardless of the number of accounts.

On the other hand, a tx with 100 accounts needs 100 HashMap::<Pubkey, _> lookups at the very least (excluding the insertion cost) for the single-threaded approach. Even Hash::hash()-ing a single Pubkey can be slower than 2 cas.

As for the need of this much of low latency, i documented here: https://github.com/solana-labs/solana/pull/34676/files#r1536585166

alessandrod · 2024-03-23T09:21:15Z

I don't think a single-threaded vanilla HashMap can compete with 20-threaded DashMap-ed loader. The multi-threaded loader only incurs overhead of 1 crossbeam::Channel::recv(), which is exactly non-contended 2 cas per a task for the scheduler thread under busy-looping or saturation for this MPSC channel arrangement, regardless of the number of accounts.

This analysis is missing the dashmap overhead, which is pretty terrible. Dashmap partitions the key space into shards, then shards are protected with rwlocks. If you look at dashmap usage in the validator now, it's a futex sleeping fest.

Have you explored moving things (eg Box) instead of using Arc and inner mutability? Why didn't that work?

EDIT: Box doesn't work, because you do want to share outside the state machine.

I'll try to play around with the code and get a better feel for the API.

ryoqun · 2024-03-23T13:54:43Z

I don't think a single-threaded vanilla HashMap can compete with 20-threaded DashMap-ed loader. The multi-threaded loader only incurs overhead of 1 crossbeam::Channel::recv(), which is exactly non-contended 2 cas per a task for the scheduler thread under busy-looping or saturation for this MPSC channel arrangement, regardless of the number of accounts.

This analysis is missing the dashmap overhead, which is pretty terrible. Dashmap partitions the key space into shards, then shards are protected with rwlocks. If you look at dashmap usage in the validator now, it's a futex sleeping fest.

as you said dashmap looks not promising. I wonder why this crate is used in the validator in the first place. ;) I went to crate shopping. how about leapfrog? the bench number seems to be promising. https://github.com/robclu/conc-map-bench?tab=readme-ov-file#read-heavy-std-hasher.

I'll try to play around with the code and get a better feel for the API.

oh, really thanks for the effort.

ryoqun · 2024-03-24T05:56:14Z

as you said dashmap looks not promising. I wonder why this crate is used in the validator in the first place. ;) I went to crate shopping. how about leapfrog? the bench number seems to be promising. https://github.com/robclu/conc-map-bench?tab=readme-ov-file#read-heavy-std-hasher.

fyi, i've run the popular bench (conc-map-bench) with Pubkey as keys. My casual finding is that leapfrog is faster in deeed. but both it and dashmap are likewisely scalable and greatly faster than hashmap. notably, i didn't see any noticeable perf degradation in dashmap:

also, note that ReadHeavy is appropriate bench scenario for UsageQueueLoader. as looking already existing entry up will by far the most common operation as our account access pattern is so skewed to certain hot ones.

ryoqun · 2024-04-02T13:22:23Z

btw, I made solana-unified-scheduler-logic tests run under miri in ci (6157e84, BLAKE3-team/BLAKE3#387, #488, #534). As far as i fiddled with the code abit locally, miri can detect all aliasing violations around UnsafeCell, which didn't caught at compile-time. hope this fact can alleviate the tension of unsafe... In other words, ci now can detect UBs, should there be now due to bugs in TokenCell or misuse at the callsite.

ryoqun · 2024-04-03T12:55:50Z

#129 (comment)

I'll try to play around with the code and get a better feel for the API.

@alessandrod How's going with this? I wonder i can merge the pr at least for now to unblock my work... i know there's some disagreement about the general design and justification of unsafe. I'm appreciating you finding of the potential use-after-free UB. that said, i think this is ub-free now, right (it's covered now even by miri)? Fortunately, there's no api compatibility thing, so we can freely change impl later as we like. or, do i absolutely have to back off the unsafe parts? (cc: @apfitzge)

(sorry, i have to ping you here from now on, because i got inquiries from multiple people about the progress of this pr...)

apfitzge · 2024-04-03T16:00:32Z

Given miri UB detection, I think I'm fine to merge this if @alessandrod is as well. I was happy with the other parts of implementation last I reviewed, and mainly had concerns about the unsafe aspects getting changes and becoming invalid. Now that miri will check that as part of ci, we can have some more confidence we won't break it.

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>

alessandrod · 2024-04-04T07:12:19Z

@alessandrod How's going with this? I wonder i can merge the pr at least for now to unblock my work

Sorry been looking into quic congestion issues this week. Yeah sure I have no problem to merge as is if it's blocking you.

ryoqun · 2024-04-04T14:10:46Z

@alessandrod How's going with this? I wonder i can merge the pr at least for now to unblock my work

... Yeah sure I have no problem to merge as is if it's blocking you.

thanks for the reply and jumping on code-review of unfamiliar code. your insightful review comments made this pr better in various ways.

I was happy with the other parts of implementation last I reviewed, and mainly had concerns about the unsafe aspects getting changes and becoming invalid. Now that miri will check that as part of ci, we can have some more confidence we won't break it.

@apfitzge thanks for positive feedback. it seems the time has finally come. :) I'm in the ship-it mood very much.. lol

could you review the new commits since you reviewed, starting from 9eed18b if you haven't? btw, I just added negative test cases for miri: f681fde, ffa1857

apfitzge

Looked at commits since my last review. Looks good, like the changes you and @alessandrod made!

ryoqun changed the title ~~Scheduling state machine for agave~~ Introduce SchedulingStateMachine for unified scheduler Mar 7, 2024

ryoqun requested a review from apfitzge March 7, 2024 08:05

ryoqun mentioned this pull request Mar 7, 2024

Introduce SchedulingStateMachine for unified scheduler [NO-MERGE&REVIEW-ONLY-MODE] solana-labs/solana#35286

Closed

ryoqun force-pushed the scheduling-state-machine-for-agave branch 2 times, most recently from 86b27ce to 9e07dcd Compare March 18, 2024 13:45

alessandrod self-requested a review March 22, 2024 07:23

alessandrod reviewed Mar 22, 2024

View reviewed changes

ryoqun commented Mar 22, 2024

View reviewed changes

ryoqun mentioned this pull request Mar 23, 2024

Introduce primitive threading in unified scheduler solana-labs/solana#34676

Merged

This was referenced Mar 29, 2024

Bump blake3 from v1.5.0 to v1.5.1 #488

Merged

Support running miri in ci #534

Merged

ryoqun force-pushed the scheduling-state-machine-for-agave branch from 8c0c6bd to 6157e84 Compare April 2, 2024 12:42

ryoqun force-pushed the scheduling-state-machine-for-agave branch from 7b2dd3a to 29066e8 Compare April 4, 2024 06:24

ryoqun and others added 6 commits April 4, 2024 15:36

Introduce SchedulingStateMachine

649e49d

Apply all typo fixes from code review

3da5193

Co-authored-by: Andrew Fitzgerald <apfitzge@gmail.com>

Update word wrapping

241793d

Clarify Token::assume_exclusive_mutating_thread()

87d70ec

Use slice instead of &Vec<_>

5653573

Improve non-const explanation

cb3b7e7

ryoqun added 19 commits April 4, 2024 15:36

Simplify unlock_usage_queue() args

1fce313

Add comment for try_unblock() -> None

e2515b0

Switch to Option<Usage> for fewer assert!s

97f3978

Add assert_matches!() to UsageQueue methods

00233f3

Add panicking test case for ::reinitialize()

98689eb

Use UsageFromTask

4230f58

Rename: LockAttempt => LockContext

789ad63

Move locking and unlocking methods to usage queue

d08aa99

Remove outdated comment...

fc743a6

Remove redundant fn: pop_unblocked_usage_from_task

9f5cac1

Document the index of task

7696a47

Clarifty comment a bit

a485017

Update .current_usage inside try_lock()

da966ef

Use inspect_err to simplify code

771ad21

fix ci...

9f9f9ff

Use ()...

b80b2fd

Rename: schedule{,_next}_unblocked_task()

dce9809

Rename: {try_lock,unlock}_{for_task,usage_queues}

2d86ed1

Test solana-unified-scheduler-logic under miri

50ddb5c

ryoqun force-pushed the scheduling-state-machine-for-agave branch from 29066e8 to b851631 Compare April 4, 2024 06:36

Test UB to illustrate limitation of TokenCell

f681fde

ryoqun force-pushed the scheduling-state-machine-for-agave branch from b851631 to f681fde Compare April 4, 2024 06:48

Test UB of using multiple tokens at the same time

ffa1857

ryoqun force-pushed the scheduling-state-machine-for-agave branch from bf19014 to ffa1857 Compare April 4, 2024 13:11

apfitzge approved these changes Apr 4, 2024

View reviewed changes

ryoqun merged commit 0b9c637 into anza-xyz:master Apr 4, 2024
50 checks passed

ryoqun changed the title ~~Introduce SchedulingStateMachine for unified scheduler~~ Introduce SchedulingStateMachine for unified scheduler Apr 5, 2024

ryoqun mentioned this pull request Apr 6, 2024

Prioritize blocked task messaging over idle tasks #627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce SchedulingStateMachine for unified scheduler #129

Introduce SchedulingStateMachine for unified scheduler #129

ryoqun commented Mar 7, 2024 •

edited

codecov-commenter commented Mar 7, 2024 •

edited

alessandrod commented Mar 16, 2024

ryoqun commented Mar 18, 2024

alessandrod Mar 22, 2024 •

edited by ryoqun

ryoqun Mar 22, 2024 •

edited

alessandrod Mar 22, 2024

apfitzge Mar 22, 2024 •

edited

ryoqun Mar 23, 2024

alessandrod Mar 23, 2024

ryoqun Mar 22, 2024

alessandrod commented Mar 23, 2024

ryoqun commented Mar 23, 2024 •

edited

alessandrod commented Mar 23, 2024 •

edited

ryoqun commented Mar 23, 2024

ryoqun commented Mar 24, 2024 •

edited

ryoqun commented Apr 2, 2024 •

edited

ryoqun commented Apr 3, 2024 •

edited

apfitzge commented Apr 3, 2024

alessandrod commented Apr 4, 2024

ryoqun commented Apr 4, 2024 •

edited

apfitzge left a comment

		/// By extension, it's allowed to create _multiple_ tokens in a _single_ process as long as no
		/// instance of [`TokenCell`] is shared by multiple instances of [`Token`].

Introduce SchedulingStateMachine for unified scheduler #129

Introduce SchedulingStateMachine for unified scheduler #129

Conversation

ryoqun commented Mar 7, 2024 • edited

perf numbers

codecov-commenter commented Mar 7, 2024 • edited

Codecov Report

alessandrod commented Mar 16, 2024

ryoqun commented Mar 18, 2024

alessandrod Mar 22, 2024 • edited by ryoqun

Choose a reason for hiding this comment

ryoqun Mar 22, 2024 • edited

Choose a reason for hiding this comment

alessandrod Mar 22, 2024

Choose a reason for hiding this comment

apfitzge Mar 22, 2024 • edited

Choose a reason for hiding this comment

ryoqun Mar 23, 2024

Choose a reason for hiding this comment

alessandrod Mar 23, 2024

Choose a reason for hiding this comment

ryoqun Mar 22, 2024

Choose a reason for hiding this comment

alessandrod commented Mar 23, 2024

ryoqun commented Mar 23, 2024 • edited

alessandrod commented Mar 23, 2024 • edited

ryoqun commented Mar 23, 2024

ryoqun commented Mar 24, 2024 • edited

ryoqun commented Apr 2, 2024 • edited

ryoqun commented Apr 3, 2024 • edited

apfitzge commented Apr 3, 2024

alessandrod commented Apr 4, 2024

ryoqun commented Apr 4, 2024 • edited

apfitzge left a comment

Choose a reason for hiding this comment

ryoqun commented Mar 7, 2024 •

edited

codecov-commenter commented Mar 7, 2024 •

edited

alessandrod Mar 22, 2024 •

edited by ryoqun

ryoqun Mar 22, 2024 •

edited

apfitzge Mar 22, 2024 •

edited

ryoqun commented Mar 23, 2024 •

edited

alessandrod commented Mar 23, 2024 •

edited

ryoqun commented Mar 24, 2024 •

edited

ryoqun commented Apr 2, 2024 •

edited

ryoqun commented Apr 3, 2024 •

edited

ryoqun commented Apr 4, 2024 •

edited