[red-knot] Add "cheap" `program.snapshot` #11172

MichaReiser · 2024-04-27T06:48:06Z

Summary

This PR implements a snapshotting mechanism for Database and makes query cancellation a Database concept. The motivation for this change is the LSP where the main loop in the LSP uses spawn to schedule some work on its thread pool. sapwn requires that all arguments have a 'static lifetime. That means passing a &Program as we've done in the CLI main loop won't be possible.

To solve this, this PR introduces a program.snapshot method that returns an owned but read-only database instance. The LSP can safely pass the snapshot to its worker thread because it satisfies the 'static lifetime requirement.

The main challenge of the snapshot function is that we don't want to create a deep-clone of the database including the jars state because:

a deep-clone of the caches is very expensive
it would mean that caching is restricted to a single thread. However, we want to share a single cache between all threads.

Getting cheap "clones" is achieved by wrapping the jars state in an Arc. However, this introduces a new problem. Now, mutating is expensive because Arcs are read-only by default. Solving this, requires introducing cancellation.

The implementation makes use of the fact that mutating an Arc in place is possible if the Arc has exactly one reference. The Database now stores its own cancellation token and calling a mutation method on the storage (or database) automatically requests cancellation of all queries and waits until all other references to jars are dropped. It is then possible to safely take the mutable reference from the Arc.

Waiting is implemented using a WaitGroup. The WaitGroup is stored in the Jars storage, and its counter is incremented whenever a Snapshot is created and automatically decremented when a Snapshot drops.

The last missing piece is to ensure that queries stop "soonish" when cancellation is requested. This is achieved by changing the db.jar() method to return a QueryResult. It returns Err(QueryError::Cancelled) in case cancellation has been requested. This requires that we change the return type of each query to QueryResult because they won't have a result when they're cancelled.
Checking cancellation in the jars method has the advantage that it is automatically tested by each query that uses caching because the method is needed to retrieve the query storage. Queries have the possibility to manually test for cancellation if needed by calling db.cancelled()?, but that should only be necessary for long operations that never issue a new query.

Catch unwind

An alternative to QueryResult is to panic with a specific error and catch that error in a catch unwind boundary. This is what salsa does. I decided to use a Result instead to make cancellation more explicit. I think that it is important to be aware that a request can be cancelled because it means that we need to ensure to never write "partial" results into the cache, and if we do, have a way to undo the change in case the query gets cancelled to leave the cache in a clean state.

Test plan

I ran the linter and used touch to trigger a re-run. This PR adds a new RED_KNOT_SLOW_LINT that adds an artificial slowdown to lint_syntax for testing cancellation.

Attribution

The ideas here are heavily inspired by Salsa and sometimes applied 1:1.

github-actions · 2024-04-28T10:58:03Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser · 2024-04-29T09:22:50Z

crates/red_knot/src/cancellation.rs

 }

 impl CancellationTokenSource {
    pub fn new() -> Self {
        Self {
-            signal: Arc::new((Mutex::new(false), Condvar::default())),
+            signal: Arc::new(AtomicBool::new(false)),


I simplified the implementation because we never used the wait method that needs the condvar.

MichaReiser · 2024-04-29T10:24:24Z

crates/red_knot/src/main.rs

-                                .send(OrchestratorMessage::CheckProgramCancelled)
-                                .unwrap(),
-                        }
+                MainLoopMessage::CheckProgram { revision } => {


@snowsignal I think the loop here is now very similar to what we have in the LSP. That's why I think that it should now be easy to implement the database into the LSP.

This does look similar! I think the main difference is that here, a response is sent back to the main loop, whereas server tasks don't (currently) communicate with the main loop. That shouldn't make things harder for the server (in fact, I think it will be even easier than what we have here), I just wanted to point that out.

That's true. Thanks for pointing this out. I agree, I don't think this should matter because that communication is only about how results are communicated back to the user interface. In the LSP case, that's done by sending a response and in the CLI case it's done by sending a message back to the main loop.

MichaReiser · 2024-04-29T10:25:18Z

crates/red_knot/src/main.rs

+                    revision,
+                } => {
+                    // Only take the diagnostics if they are for the latest revision.
+                    if self.revision == revision {


@carljm Thanks for the suggestion using a revision. I think it simplifies a lot.

MichaReiser · 2024-04-29T10:27:07Z

crates/red_knot/src/program/check.rs


        for file in files {
-            self.queue_file(file, context.clone())?;
+            self.queue_file(file, context.clone());


This can deadlock when files.len() > max_concurrency). I have a follow up PR to fix this.

MichaReiser · 2024-04-29T10:28:38Z

crates/red_knot/src/program/mod.rs

@@ -135,23 +127,59 @@ impl SemanticDb for Program {

 impl Db for Program {}

+impl Database for Program {


I think we can derive the implementations in the future by having a #[derive(Db(field_name=jars, jars=(SourceJar, SemanticJar)))]

MichaReiser · 2024-04-29T11:09:29Z

crates/red_knot/src/db/storage.rs

+    ///
+    /// The method cancels any pending queries of other works and waits for them to complete so that
+    /// this instance is the only instance holding a reference to the jars.
+    pub(crate) fn jars_mut(&mut self) -> &mut Db::Jars {


This takes a &mut self, so this method can only be called from the main database but never from a Snapshot.

snowsignal

Thank you for implementing this! The detailed comments you left made this pull request really easy to read, and I can see a way to integrate this with the server as-is 😄

I've left a few small comments but otherwise this looks great!

crates/red_knot/src/db.rs

snowsignal · 2024-04-29T16:54:21Z

crates/red_knot/src/db.rs

+impl<DB> std::ops::Deref for Snapshot<DB>
+where
+    DB: ParallelDatabase,
+{
+    type Target = DB;
+
+    fn deref(&self) -> &DB {
+        &self.db
+    }
+}


I really like how you ensured snapshot immutability here 😄

Thank you, but I don't really deserve the credit because I only stole the idea from salsa :D But I agree, it's a very clever (and simple) way.

snowsignal · 2024-04-29T17:05:04Z

crates/red_knot/src/main.rs

-                                .send(OrchestratorMessage::CheckProgramCancelled)
-                                .unwrap(),
-                        }
+                MainLoopMessage::CheckProgram { revision } => {


This does look similar! I think the main difference is that here, a response is sent back to the main loop, whereas server tasks don't (currently) communicate with the main loop. That shouldn't make things harder for the server (in fact, I think it will be even easier than what we have here), I just wanted to point that out.

snowsignal · 2024-04-29T17:18:06Z

crates/red_knot/src/program/check.rs

+        if cancelled {
+            Err(QueryError::Cancelled)
+        } else {
+            Ok(result)


Shouldn't we return Err(QueryError::Cancelled) immediately if we get CheckFileMessage::Cancelled, instead of waiting for the loop to finish?

The problem with returning immediately is that we then drop the only reference to receiver. That means, that any sender.send(message).unwrap() calls will fail in the threads checking the files. I'm not sure if there's a more idiomatic way of doing this other than waiting for all file check operations to complete to be sure there will be no more incoming messages and only then exit.

Can't we just handling that error in the threads checking the files though? Like, if sender.send(message) fails, we can still exit the thread gracefully instead of panicking.

I guess we could, but it would mean that it will be harder to find the bug if we drop the receiver accidentally for another reason. Anyway. The next PR reworked this quiet significantly and I think there we can return immediately.

carljm

Looks good!

To be clear, the "immutable snapshot" means we won't be applying file changes or invalidating anything concurrently (which is great, because it will simplify invalidation a lot). But caching within e.g. TypeStore uses interior mutability and can still be updated in workers, because it implements its own internal locking and doesn't rely on having an exclusive reference to the db.

MichaReiser · 2024-04-30T07:00:38Z

To be clear, the "immutable snapshot" means we won't be applying file changes or invalidating anything concurrently (which is great, because it will simplify invalidation a lot). But caching within e.g. TypeStore uses interior mutability and can still be updated in workers, because it implements its own internal locking and doesn't rely on having an exclusive reference to the db.

Yes, that's correct. The immutable snapshot still allows for caches to cache new values, but no "inputs" should change (which would change the result of the analysis). Another way to think about this.

MichaReiser force-pushed the red-knot-snapshotting branch 5 times, most recently from 5c3f9d0 to c3f3724 Compare April 28, 2024 10:40

MichaReiser changed the title ~~[red-knot] Add snapshotting mechanism~~ [red-knot] Add "cheap" program.snapshot Apr 29, 2024

MichaReiser commented Apr 29, 2024

View reviewed changes

MichaReiser added the internal An internal refactor or improvement label Apr 29, 2024

MichaReiser requested review from carljm and snowsignal April 29, 2024 11:08

MichaReiser marked this pull request as ready for review April 29, 2024 11:08

MichaReiser commented Apr 29, 2024

View reviewed changes

MichaReiser force-pushed the red-knot-snapshotting branch from 5223c65 to 5cda829 Compare April 29, 2024 11:09

snowsignal approved these changes Apr 29, 2024

View reviewed changes

carljm approved these changes Apr 29, 2024

View reviewed changes

MichaReiser added 3 commits April 30, 2024 08:46

Introduce QueryResult

7b522af

Add readonly database snapshot

edf7b3d

Add more documentation

3923c43

MichaReiser force-pushed the red-knot-snapshotting branch from 5cda829 to 71f6414 Compare April 30, 2024 07:00

Code review feedback

029dd1c

MichaReiser force-pushed the red-knot-snapshotting branch from 71f6414 to 029dd1c Compare April 30, 2024 07:06

MichaReiser enabled auto-merge (squash) April 30, 2024 07:06

MichaReiser merged commit bc03d37 into main Apr 30, 2024
19 checks passed

MichaReiser deleted the red-knot-snapshotting branch April 30, 2024 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[red-knot] Add "cheap" `program.snapshot` #11172

[red-knot] Add "cheap" `program.snapshot` #11172

MichaReiser commented Apr 27, 2024 •

edited

github-actions bot commented Apr 28, 2024 •

edited

MichaReiser Apr 29, 2024

MichaReiser Apr 29, 2024

snowsignal Apr 29, 2024

MichaReiser Apr 30, 2024

MichaReiser Apr 29, 2024

MichaReiser Apr 29, 2024

MichaReiser Apr 29, 2024

MichaReiser Apr 29, 2024

snowsignal left a comment

snowsignal Apr 29, 2024

MichaReiser Apr 30, 2024

snowsignal Apr 29, 2024

snowsignal Apr 29, 2024

MichaReiser Apr 30, 2024

snowsignal Apr 30, 2024

MichaReiser Apr 30, 2024

carljm left a comment

MichaReiser commented Apr 30, 2024

		@@ -135,23 +127,59 @@ impl SemanticDb for Program {

		impl Db for Program {}

		impl Database for Program {

[red-knot] Add "cheap" program.snapshot #11172

[red-knot] Add "cheap" program.snapshot #11172

Conversation

MichaReiser commented Apr 27, 2024 • edited

Summary

Catch unwind

Test plan

Attribution

github-actions bot commented Apr 28, 2024 • edited

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

snowsignal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carljm left a comment

Choose a reason for hiding this comment

MichaReiser commented Apr 30, 2024

[red-knot] Add "cheap" `program.snapshot` #11172

[red-knot] Add "cheap" `program.snapshot` #11172

MichaReiser commented Apr 27, 2024 •

edited

github-actions bot commented Apr 28, 2024 •

edited

`ruff-ecosystem` results