Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor mypy to use query-based architecture #12911

Open
pranavrajpal opened this issue May 31, 2022 · 2 comments
Open

Refactor mypy to use query-based architecture #12911

pranavrajpal opened this issue May 31, 2022 · 2 comments
Labels
feature refactoring Changing mypy's internals topic-developer Issues relevant to mypy developers

Comments

@pranavrajpal
Copy link
Contributor

Feature

I think mypy would benefit from refactoring to use a query-based architecture similar to the one used by Rust's compiler or rust-analyzer because it would:

  • make some relatively common types of bugs nearly impossible
  • make mypy internals easier to understand
  • allow us to implement certain performance optimizations much more easily

Pitch

The general idea behind a query-based architecture is that, instead of organizing code based on which pass of the type checker it runs during, you would organize your code into pure functions that each let you query information about the inputted code (such as by computing the AST given the text of a file, or getting the type of a given symbol, or type checking a function to get the resulting errors). You'd also have a few non-pure input queries that you can control that represent the inputs to the type checker (such as the files to type check or the configuration options)

After that, there would be separate code for executing these queries which would be responsible for:

  • Executing your main query and all other queries it makes
  • Keeping track of what other queries each query depends on (i.e. what other queries it makes while running)
  • Caching the output of any queries as much or as little as it wants (depending on whether caching the result is worth it for that query)
  • (Optionally) saving those caches to a file and selectively invalidating caches on the next run based on new inputs to get incremental type checking

Using pure functions for queries is what allows you to get most of the benefits of this system because pure functions, by definition, always return the same result for the same arguments no matter how many times they're called. That means that the query runner is free to reorder and cache queries as much as possible, which has a lot of benefits (see below).

One other restriction that's necessary to make this work is that queries can't have cyclic dependencies because that would make it impossible to determine the dependencies of any given query. That's not really a new restriction, because a cyclic dependency essentially means infinite recursion, but it somewhat limits what we can do with this.

Benefits of Query Architecture

Code readability

Because, as mentioned earlier, queries can be run in any order and as many times as necessary, the actual logic doesn't need to concern itself with specifying either of those things. That means that, instead of having one large class that runs all the code for a single pass, updating all the results as it goes, you would have separate functions with clearly defined inputs and outputs that compute different information as a part of that pass.

Since we're organizing code just as functions that each compute some information about the passed in code, this gives you the opportunity to place code from different passes together if that makes it easier to understand when organizing the code base, instead of needing to organize based on what pass it happens to run during. Regardless of how you organize the code, the query runner will automatically figure out what order to run any queries in.

Correctness

There is a lot of code in mypy that has subtle dependencies on other parts of the codebase being run, which in practice leads to hard to track down bugs.

The most obvious example of that is the strategy we use in semantic analysis of deferring nodes whenever we don't have enough information about that node. Deferring nodes means that different parts of semanal assume that the analysis needed for progress to continue will be run in some other part of semanal. When those assumptions turn out to be wrong, that usually leads to a "Must not defer during final iteration" error or some other error about reaching the maximum iteration count.

With a query-based architecture, you would completely avoid that problem. Instead of checking if the relevant information is there and deferring if not, you would call a function to directly query that information, and the query runner would run whatever queries are necessary to compute that information, even if that meant processing code that you haven't reached yet.

That problem gets worse when you have other subtle dependencies on the order in which different functions are run. An incomplete list of examples of this in mypy:

An added correctness-related benefit of a query-based system is that, as I mentioned earlier, supporting incremental checking requires no changes to the core logic of type checking, so it would end up avoiding all of the incremental-related issues and discrepancies between non-incremental and the different types of incremental checking we support. Judging from comments like this and this, fine-grained incremental checking is rather tricky to get right, so a query-based system would be helpful there.

Performance

There's been some recent effort to improve mypy's performance both by caching more intermediate results (#12707, #12659, #12526) and by avoiding doing unnecessary work (#10922, #12854).

A benefit of a query-based system is that both of those are built in to the system. Since all the core logic is written using pure functions, caching any query can be done without affecting the correctness of the code it's running.

Additionally, avoiding doing unnecessary work is also straightforward because queries are computed on demand. Instead of having a pass that computes all the information that you might need in later passes (like we do currently), and finding specific opportunities to avoid computing something, you would just write a function that computes what you want when it's called. That means that, if you never call that function, it never gets computed, and you avoid that work completely.

Unresolved questions

I'm not really sure what the best place to implement a proof of concept for this query-based architecture. While I'm hoping that we eventually use it for all of mypy, I don't want to rewrite the entire code base at once, so finding a decent place to try this out and see how it looks in practice seems useful.

Ideally, whatever code that we use for a proof of concept would:

  • be something that takes advantage of the benefits of a query-based architecture (either by letting us avoid unnecessary work, simplify the existing code significantly, or something else)
  • not be dependent on too much other code in mypy (so we can avoid rewriting most of the code base at once)
  • not be a helper function used by plugins (in order to avoid dealing with breaking plugins)
  • be useful on its own (in case we don't continue past the proof of concept)

I don't know what parts of mypy fit that description or even part of that description, so does anyone else have any ideas for that?

@AlexWaygood AlexWaygood added refactoring Changing mypy's internals topic-developer Issues relevant to mypy developers labels May 31, 2022
@JukkaL
Copy link
Collaborator

JukkaL commented May 31, 2022

This could definitely make it easier to avoid various tricky issues. The current semantic analyzer and fine-grained incremental approaches are error-prone, in part because the original design of mypy didn't really anticipate many of the complexities involved, and retrofitting these without a big redesign left a number of rough edges.

Originally mypy targeted a Python subset with a syntax that was much easier to support. As mypy was migrated to use regular (and full) Python syntax, and various new features such as dataclasses were added, things got a lot trickier, and the original architecture had to be stretched quite thin.

I see resourcing as a big risk with the proposal. The smaller-scale redesigns we've had (fine-grained incremental checking, new semantic analyzer) took multiple months of full-time effort to complete, and this sounds larger in scope. The effort is hard to estimate, however. In my experience half the effort in a big project like this is easily spent on various complex edge cases that are hard to anticipate. The query-based architecture would be a big help with some edge cases, but there may be other things that may still be hard to support.

To avoid the risk of having a half-finished migration, it would be important to have at least two contributors committed to it (and with enough time to dedicate to it). Priorities and conditions change, and with a single contributor there is a real risk that the work is never finished if they don't have enough time to complete the project. Also it's necessary to have somebody else familiar with the work who can review PRs.

Starting with a proof of concept is a great idea. We wouldn't need to worry about resourcing yet. One possible initial target could be type checking expressions, though it depends on a lot of other machinery and interacts with plugins. However, many of our performance bottlenecks are related to type checking expressions, so just finishing this could provide some incremental value. On the other hand, perhaps this is "too easy" and doesn't give any indication of how difficult the full project would be.

@pranavrajpal
Copy link
Contributor Author

I see resourcing as a big risk with the proposal. The smaller-scale redesigns we've had (fine-grained incremental checking, new semantic analyzer) took multiple months of full-time effort to complete, and this sounds larger in scope.

That's definitely a good point, but I think this proposal has the benefit that it's easier to do it in smaller pieces. Effort spent implementing fine-grained incremental checking or a new semantic analyzer is essentially wasted if those features don't reach a point where they can be turned on, but switching to a query-based architecture, on the other hand, can be done by changing some parts of the code to use queries without changing the rest. After all, a function can use a query-based architecture internally to compute something and then return it normally without any of its callers needing to know that queries were used. Using queries in only part of the codebase limits how much we can take advantage of the architecture, but it still results in working usable code.

One possible initial target could be type checking expressions

That seems like an interesting idea, especially since it would let us see if the performance benefits are actually worthwhile. Do you happen to have any benchmarks or specific performance bottlenecks that could be used as a target to optimize?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature refactoring Changing mypy's internals topic-developer Issues relevant to mypy developers
Projects
None yet
Development

No branches or pull requests

3 participants