New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[red-knot] Add "cheap" program.snapshot
#11172
Conversation
5c3f9d0
to
c3f3724
Compare
|
program.snapshot
} | ||
|
||
impl CancellationTokenSource { | ||
pub fn new() -> Self { | ||
Self { | ||
signal: Arc::new((Mutex::new(false), Condvar::default())), | ||
signal: Arc::new(AtomicBool::new(false)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the implementation because we never used the wait
method that needs the condvar.
.send(OrchestratorMessage::CheckProgramCancelled) | ||
.unwrap(), | ||
} | ||
MainLoopMessage::CheckProgram { revision } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snowsignal I think the loop here is now very similar to what we have in the LSP. That's why I think that it should now be easy to implement the database into the LSP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does look similar! I think the main difference is that here, a response is sent back to the main loop, whereas server tasks don't (currently) communicate with the main loop. That shouldn't make things harder for the server (in fact, I think it will be even easier than what we have here), I just wanted to point that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Thanks for pointing this out. I agree, I don't think this should matter because that communication is only about how results are communicated back to the user interface. In the LSP case, that's done by sending a response and in the CLI case it's done by sending a message back to the main loop.
revision, | ||
} => { | ||
// Only take the diagnostics if they are for the latest revision. | ||
if self.revision == revision { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carljm Thanks for the suggestion using a revision. I think it simplifies a lot.
|
||
for file in files { | ||
self.queue_file(file, context.clone())?; | ||
self.queue_file(file, context.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can deadlock when files.len() > max_concurrency)
. I have a follow up PR to fix this.
@@ -135,23 +127,59 @@ impl SemanticDb for Program { | |||
|
|||
impl Db for Program {} | |||
|
|||
impl Database for Program { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can derive the implementations in the future by having a #[derive(Db(field_name=jars, jars=(SourceJar, SemanticJar)))]
/// | ||
/// The method cancels any pending queries of other works and waits for them to complete so that | ||
/// this instance is the only instance holding a reference to the jars. | ||
pub(crate) fn jars_mut(&mut self) -> &mut Db::Jars { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This takes a &mut self
, so this method can only be called from the main database but never from a Snapshot
.
5223c65
to
5cda829
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing this! The detailed comments you left made this pull request really easy to read, and I can see a way to integrate this with the server as-is 😄
I've left a few small comments but otherwise this looks great!
impl<DB> std::ops::Deref for Snapshot<DB> | ||
where | ||
DB: ParallelDatabase, | ||
{ | ||
type Target = DB; | ||
|
||
fn deref(&self) -> &DB { | ||
&self.db | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how you ensured snapshot immutability here 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, but I don't really deserve the credit because I only stole the idea from salsa :D But I agree, it's a very clever (and simple) way.
.send(OrchestratorMessage::CheckProgramCancelled) | ||
.unwrap(), | ||
} | ||
MainLoopMessage::CheckProgram { revision } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does look similar! I think the main difference is that here, a response is sent back to the main loop, whereas server tasks don't (currently) communicate with the main loop. That shouldn't make things harder for the server (in fact, I think it will be even easier than what we have here), I just wanted to point that out.
if cancelled { | ||
Err(QueryError::Cancelled) | ||
} else { | ||
Ok(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we return Err(QueryError::Cancelled)
immediately if we get CheckFileMessage::Cancelled
, instead of waiting for the loop to finish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with returning immediately is that we then drop the only reference to receiver
. That means, that any sender.send(message).unwrap()
calls will fail in the threads checking the files. I'm not sure if there's a more idiomatic way of doing this other than waiting for all file check operations to complete to be sure there will be no more incoming messages and only then exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just handling that error in the threads checking the files though? Like, if sender.send(message)
fails, we can still exit the thread gracefully instead of panicking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could, but it would mean that it will be harder to find the bug if we drop the receiver accidentally for another reason. Anyway. The next PR reworked this quiet significantly and I think there we can return immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
To be clear, the "immutable snapshot" means we won't be applying file changes or invalidating anything concurrently (which is great, because it will simplify invalidation a lot). But caching within e.g. TypeStore
uses interior mutability and can still be updated in workers, because it implements its own internal locking and doesn't rely on having an exclusive reference to the db.
Yes, that's correct. The immutable snapshot still allows for caches to cache new values, but no "inputs" should change (which would change the result of the analysis). Another way to think about this. |
5cda829
to
71f6414
Compare
71f6414
to
029dd1c
Compare
Summary
This PR implements a snapshotting mechanism for
Database
and makes query cancellation aDatabase
concept. The motivation for this change is the LSP where the main loop in the LSP usesspawn
to schedule some work on its thread pool.sapwn
requires that all arguments have a'static
lifetime. That means passing a&Program
as we've done in the CLI main loop won't be possible.To solve this, this PR introduces a
program.snapshot
method that returns an owned but read-only database instance. The LSP can safely pass the snapshot to its worker thread because it satisfies the'static
lifetime requirement.The main challenge of the
snapshot
function is that we don't want to create a deep-clone of the database including thejars
state because:Getting cheap "clones" is achieved by wrapping the
jars
state in anArc
. However, this introduces a new problem. Now, mutating is expensive becauseArc
s are read-only by default. Solving this, requires introducing cancellation.The implementation makes use of the fact that mutating an
Arc
in place is possible if theArc
has exactly one reference. TheDatabase
now stores its own cancellation token and calling a mutation method on the storage (or database) automatically requests cancellation of all queries and waits until all other references tojars
are dropped. It is then possible to safely take the mutable reference from theArc
.Waiting is implemented using a
WaitGroup
. The WaitGroup is stored in theJars
storage, and its counter is incremented whenever aSnapshot
is created and automatically decremented when aSnapshot
drops.The last missing piece is to ensure that queries stop "soonish" when cancellation is requested. This is achieved by changing the
db.jar()
method to return aQueryResult
. It returnsErr(QueryError::Cancelled)
in case cancellation has been requested. This requires that we change the return type of each query toQueryResult
because they won't have a result when they're cancelled.Checking cancellation in the
jars
method has the advantage that it is automatically tested by each query that uses caching because the method is needed to retrieve the query storage. Queries have the possibility to manually test for cancellation if needed by callingdb.cancelled()?
, but that should only be necessary for long operations that never issue a new query.Catch unwind
An alternative to
QueryResult
is to panic with a specific error and catch that error in a catch unwind boundary. This is what salsa does. I decided to use aResult
instead to make cancellation more explicit. I think that it is important to be aware that a request can be cancelled because it means that we need to ensure to never write "partial" results into the cache, and if we do, have a way to undo the change in case the query gets cancelled to leave the cache in a clean state.Test plan
I ran the linter and used
touch
to trigger a re-run. This PR adds a newRED_KNOT_SLOW_LINT
that adds an artificial slowdown tolint_syntax
for testing cancellation.Attribution
The ideas here are heavily inspired by Salsa and sometimes applied 1:1.