Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[red-knot] Add "cheap" program.snapshot #11172

Merged
merged 4 commits into from Apr 30, 2024
Merged

Conversation

MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Apr 27, 2024

Summary

This PR implements a snapshotting mechanism for Database and makes query cancellation a Database concept. The motivation for this change is the LSP where the main loop in the LSP uses spawn to schedule some work on its thread pool. sapwn requires that all arguments have a 'static lifetime. That means passing a &Program as we've done in the CLI main loop won't be possible.

To solve this, this PR introduces a program.snapshot method that returns an owned but read-only database instance. The LSP can safely pass the snapshot to its worker thread because it satisfies the 'static lifetime requirement.

The main challenge of the snapshot function is that we don't want to create a deep-clone of the database including the jars state because:

  • a deep-clone of the caches is very expensive
  • it would mean that caching is restricted to a single thread. However, we want to share a single cache between all threads.

Getting cheap "clones" is achieved by wrapping the jars state in an Arc. However, this introduces a new problem. Now, mutating is expensive because Arcs are read-only by default. Solving this, requires introducing cancellation.

The implementation makes use of the fact that mutating an Arc in place is possible if the Arc has exactly one reference. The Database now stores its own cancellation token and calling a mutation method on the storage (or database) automatically requests cancellation of all queries and waits until all other references to jars are dropped. It is then possible to safely take the mutable reference from the Arc.

Waiting is implemented using a WaitGroup. The WaitGroup is stored in the Jars storage, and its counter is incremented whenever a Snapshot is created and automatically decremented when a Snapshot drops.

The last missing piece is to ensure that queries stop "soonish" when cancellation is requested. This is achieved by changing the db.jar() method to return a QueryResult. It returns Err(QueryError::Cancelled) in case cancellation has been requested. This requires that we change the return type of each query to QueryResult because they won't have a result when they're cancelled.
Checking cancellation in the jars method has the advantage that it is automatically tested by each query that uses caching because the method is needed to retrieve the query storage. Queries have the possibility to manually test for cancellation if needed by calling db.cancelled()?, but that should only be necessary for long operations that never issue a new query.

Catch unwind

An alternative to QueryResult is to panic with a specific error and catch that error in a catch unwind boundary. This is what salsa does. I decided to use a Result instead to make cancellation more explicit. I think that it is important to be aware that a request can be cancelled because it means that we need to ensure to never write "partial" results into the cache, and if we do, have a way to undo the change in case the query gets cancelled to leave the cache in a clean state.

Test plan

I ran the linter and used touch to trigger a re-run. This PR adds a new RED_KNOT_SLOW_LINT that adds an artificial slowdown to lint_syntax for testing cancellation.

Attribution

The ideas here are heavily inspired by Salsa and sometimes applied 1:1.

@MichaReiser MichaReiser force-pushed the red-knot-snapshotting branch 5 times, most recently from 5c3f9d0 to c3f3724 Compare April 28, 2024 10:40
Copy link

github-actions bot commented Apr 28, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@MichaReiser MichaReiser changed the title [red-knot] Add snapshotting mechanism [red-knot] Add "cheap" program.snapshot Apr 29, 2024
}

impl CancellationTokenSource {
pub fn new() -> Self {
Self {
signal: Arc::new((Mutex::new(false), Condvar::default())),
signal: Arc::new(AtomicBool::new(false)),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the implementation because we never used the wait method that needs the condvar.

.send(OrchestratorMessage::CheckProgramCancelled)
.unwrap(),
}
MainLoopMessage::CheckProgram { revision } => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snowsignal I think the loop here is now very similar to what we have in the LSP. That's why I think that it should now be easy to implement the database into the LSP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does look similar! I think the main difference is that here, a response is sent back to the main loop, whereas server tasks don't (currently) communicate with the main loop. That shouldn't make things harder for the server (in fact, I think it will be even easier than what we have here), I just wanted to point that out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Thanks for pointing this out. I agree, I don't think this should matter because that communication is only about how results are communicated back to the user interface. In the LSP case, that's done by sending a response and in the CLI case it's done by sending a message back to the main loop.

revision,
} => {
// Only take the diagnostics if they are for the latest revision.
if self.revision == revision {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carljm Thanks for the suggestion using a revision. I think it simplifies a lot.


for file in files {
self.queue_file(file, context.clone())?;
self.queue_file(file, context.clone());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can deadlock when files.len() > max_concurrency). I have a follow up PR to fix this.

@@ -135,23 +127,59 @@ impl SemanticDb for Program {

impl Db for Program {}

impl Database for Program {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can derive the implementations in the future by having a #[derive(Db(field_name=jars, jars=(SourceJar, SemanticJar)))]

@MichaReiser MichaReiser added the internal An internal refactor or improvement label Apr 29, 2024
@MichaReiser MichaReiser marked this pull request as ready for review April 29, 2024 11:08
///
/// The method cancels any pending queries of other works and waits for them to complete so that
/// this instance is the only instance holding a reference to the jars.
pub(crate) fn jars_mut(&mut self) -> &mut Db::Jars {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes a &mut self, so this method can only be called from the main database but never from a Snapshot.

Copy link
Member

@snowsignal snowsignal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this! The detailed comments you left made this pull request really easy to read, and I can see a way to integrate this with the server as-is 😄

I've left a few small comments but otherwise this looks great!

crates/red_knot/src/db.rs Outdated Show resolved Hide resolved
Comment on lines +84 to +96
impl<DB> std::ops::Deref for Snapshot<DB>
where
DB: ParallelDatabase,
{
type Target = DB;

fn deref(&self) -> &DB {
&self.db
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how you ensured snapshot immutability here 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, but I don't really deserve the credit because I only stole the idea from salsa :D But I agree, it's a very clever (and simple) way.

.send(OrchestratorMessage::CheckProgramCancelled)
.unwrap(),
}
MainLoopMessage::CheckProgram { revision } => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does look similar! I think the main difference is that here, a response is sent back to the main loop, whereas server tasks don't (currently) communicate with the main loop. That shouldn't make things harder for the server (in fact, I think it will be even easier than what we have here), I just wanted to point that out.

Comment on lines +271 to +274
if cancelled {
Err(QueryError::Cancelled)
} else {
Ok(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return Err(QueryError::Cancelled) immediately if we get CheckFileMessage::Cancelled, instead of waiting for the loop to finish?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with returning immediately is that we then drop the only reference to receiver. That means, that any sender.send(message).unwrap() calls will fail in the threads checking the files. I'm not sure if there's a more idiomatic way of doing this other than waiting for all file check operations to complete to be sure there will be no more incoming messages and only then exit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just handling that error in the threads checking the files though? Like, if sender.send(message) fails, we can still exit the thread gracefully instead of panicking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could, but it would mean that it will be harder to find the bug if we drop the receiver accidentally for another reason. Anyway. The next PR reworked this quiet significantly and I think there we can return immediately.

Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

To be clear, the "immutable snapshot" means we won't be applying file changes or invalidating anything concurrently (which is great, because it will simplify invalidation a lot). But caching within e.g. TypeStore uses interior mutability and can still be updated in workers, because it implements its own internal locking and doesn't rely on having an exclusive reference to the db.

@MichaReiser
Copy link
Member Author

To be clear, the "immutable snapshot" means we won't be applying file changes or invalidating anything concurrently (which is great, because it will simplify invalidation a lot). But caching within e.g. TypeStore uses interior mutability and can still be updated in workers, because it implements its own internal locking and doesn't rely on having an exclusive reference to the db.

Yes, that's correct. The immutable snapshot still allows for caches to cache new values, but no "inputs" should change (which would change the result of the analysis). Another way to think about this.

@MichaReiser MichaReiser enabled auto-merge (squash) April 30, 2024 07:06
@MichaReiser MichaReiser merged commit bc03d37 into main Apr 30, 2024
19 checks passed
@MichaReiser MichaReiser deleted the red-knot-snapshotting branch April 30, 2024 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants