Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create migration tool for call-by-name syntax #20572

Closed
stuhood opened this issue Feb 17, 2024 · 31 comments
Closed

Create migration tool for call-by-name syntax #20572

stuhood opened this issue Feb 17, 2024 · 31 comments
Assignees
Labels
external-plugins Issues related to plugins outside this repository

Comments

@stuhood
Copy link
Sponsor Member

stuhood commented Feb 17, 2024

To complete #19730 by migrating Pants itself and any in-repo plugins to call-by-name syntax, we should create a built-in goal which would:

  1. Take as argument a list of modules / files to migrate
  2. Load the rule graph
  3. For each @rule in the module, look up its solution in the rule graph.
  4. Replace each Get in a @rule, with its corresponding call-by-name syntax (using whatever code-rewriting API is best maintained currently)

EDIT: #20574 implemented items 2 and 3.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Feb 17, 2024

cc @sureshjoshi , who mentioned being interested in this at the recent team meeting.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Feb 17, 2024

@sureshjoshi : One way to split this up might be for me to implement something that does steps 1 through 3 from the description, and then for you to do step 4.

I will probably have 1-3 done this weekend, but I don't really have any context on 4.

@sureshjoshi
Copy link
Member

@stuhood Sure, whatever you think the most optimal way of going through this is.

The essence of what I bought up was in the meeting was that, I want faster Pants startup times and for run_pants integration testing to be less painful. Benjy mentioned that your call by name syntax would be a big win for my underlying pain points.

I won't be touching/looking at anything until my current project reaches its conference, and then I'm allocating a sizeable chunk of time (60 hours starting in mid-March , 80ish in April, 80ish in May) for a few Pants-centric items that I finally want to get out of my fork, and this.

As far as (4) off the top of my head, I'm not sure. I've done this kinda thing in Typescript and in Swift (using SwiftSyntax), but not in Python. Without much thought, my idea was to use something like treesitter's AST, or maybe even ast-grep, but I haven't really looked into this yet.

@sureshjoshi
Copy link
Member

Oh, also @stuhood @benjyw - is call-by-name intended to be a full replacement for the existing pattern or an 'also we have this'?

I ask because I'm wondering if it's worth writing a lint warning rule for incoming PRs to gently recommend using call-by-name.

@benjyw
Copy link
Sponsor Contributor

benjyw commented Feb 18, 2024

Medium term we'd like to get rid of the solver entirely, since it's complex code, so we want to no longer support call by signature.

@sureshjoshi
Copy link
Member

Gotcha, so essentially a forced migration for in-repo plugins

@benjyw
Copy link
Sponsor Contributor

benjyw commented Feb 18, 2024

We may want to bump the major version to 3.x when we get rid of the solver, as this will require substantial rewrites of external plugins.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Feb 18, 2024

We may want to bump the major version to 3.x when we get rid of the solver, as this will require substantial rewrites of external plugins.

It shouldn't, assuming that they have executed the rewrite? Once the rewrite has been executed, the solver just (mostly) isn't used.

More generally, thinking that this change might be necessary (due to the complexity of the solver) was in the back of my mind ever since 2.0.0, and was one of the main reasons why I didn't want to stabilize the plugin system. I think that with this type of change behind us, it would still be possible to stabilize the plugin system within the 2.x.y series.

stuhood added a commit that referenced this issue Feb 23, 2024
…ation. (#20574)

As discussed in #20572, this change adds a built-in goal
`migrate-call-by-name` which emits all `Get` calls which should be
migrated to call-by-name syntax.

A followup change should use the resulting information (`@rule` function
pointers and `Get` signatures) to execute rewrites of the relevant
files.

Emits lines like:
```
<function infer_python_conftest_dependencies at 0x113633af0>
  Get(<class 'pants.engine.target.Targets'>, [<class 'pants.engine.addresses.Addresses'>]) -> <function resolve_targets at 0x112335d30>
  Get(<class 'pants.engine.internals.graph.Owners'>, [<class 'pants.engine.internals.graph.OwnersRequest'>]) -> <function find_owners at 0x112349160>
  Get(<class 'pants.backend.python.util_rules.ancestor_files.AncestorFiles'>, [<class 'pants.backend.python.util_rules.ancestor_files.AncestorFilesRequest'>]) -> <function find_ancestor_files at 0x1131e6790>
```
@stuhood stuhood removed their assignment Feb 24, 2024
@stuhood
Copy link
Sponsor Member Author

stuhood commented Feb 26, 2024

One other item here: MultiGet is still named MultiGet, but if we're going to rename it (I liked @benjyw's suggestion of await concurrently(...) here), we should likely do so before shipping this migration tool, so that we don't need to do multiple large scale rewrites.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Mar 2, 2024

It looks we already use tokenize internally for the BUILD file rewriting tool.

It seems to be possible to preserve most formatting (importantly: comments) with tokenize, so I think that one way to do this would be to find the bounds of a @rule function, and then replace all Get-call tokens inside of it (tokenize.tokenize includes token positions) with the relevant name-calls. You'd also need to add imports, which could be slightly trickier, but not impossible.

@sureshjoshi
Copy link
Member

sureshjoshi commented Mar 8, 2024

That's pretty cool, I'd never used tokenize before - I was thinking ast-grep might be a good approach, but the fewer deps, the better (granted, I'd only used ast-grep for find/replace, not for anything more sophisticated).

Just doing some ad hoc hacking - seems to be pretty do-able so far, when I'm actually looking into this, I'll probably slice off a small plugin to do a prototype PR on

@sureshjoshi sureshjoshi self-assigned this Mar 12, 2024
@sureshjoshi
Copy link
Member

sureshjoshi commented Mar 15, 2024

@stuhood Now that the completions PR is being reviewed, will be focusing on this next

Before I jump into the re-writing part, what is the canonical single Get re-write code? I see implicitly and a lot of ellipses, but I don't know how many of those are literals and how many are "fill in the blanks" between the discussion and various tickets/code.

The test cases of implicitly are also a bit confusing:

    @rule
    def rule1(arg: int) -> int:
        return arg

    @rule
    async def rule2() -> int:
        return 2

    async def rule3() -> int:
        one_explicit = await rule1(1)
        one_implicit = await rule1(**implicitly(int(1)))
        two = await rule2()
        return one_explicit + one_implicit + two

And from the original proposal:

# Equivalent to `Get(ReturnType)` (i.e. no arguments to `Get`). All arguments would be
# computed from Params which were already in scope:
await the_rule_to_call(**implicitly())

# Using a positional arg. Roughly equivalent to `Get(ReturnType, Arg1())`: the remaining arguments would be
# computed from Params which were already in scope:
await the_rule_to_call(Arg1(), **implicitly())

# Two positional args. Roughly equivalent to `Get(ReturnType, {Arg1(): Arg1, Arg2(): Arg2)`. Note that
# `@rule` graph solving is still necessary for this case, because the called `@rule` may have `Get`s
# which have additional dependencies.
await the_rule_to_call(Arg1(), Arg2())

# Equivalent to `Get(ReturnType, Arg1())`:
await the_rule_to_call(**implicitly(Arg1()))

# Equivalent to `Get(ReturnType, {Arg1(): Arg1, Arg2})`:
await the_rule_to_call(**implicitly({Arg1(): Arg1, Arg2(): Arg2}))

As a practical example from the codebase:

# rules.py
@rule
async def get_graphql_uvicorn_setup(
    request: GraphQLUvicornServerSetupRequest, graphql: GraphQLSubsystem
) -> UvicornServerSetup:
    browser = await Get(Browser, BrowserRequest, request.browser_request())
    return UvicornServerSetup(graphql_uvicorn_setup(browser, graphql=graphql))

# browser.py
@rule
async def get_browser(request: BrowserRequest, open_binary: OpenBinary) -> Browser:
    return Browser(open_binary, request.protocol, request.server)

# breakdown
<function get_graphql_uvicorn_setup at 0x11776b700>
  Get(<class 'pants_explorer.server.browser.Browser'>, 
     [<class 'pants_explorer.server.browser.BrowserRequest'>]) 
     -> <function get_browser at 0x112b8c670>

would that turn into one of the following?

browser = await get_browser(request.browser_request(), **implicitly())

browser = await get_browser(**implicitly(request.browser_request()))

Also, would we want to explicitly type the browser: Browser on re-write? Not sure if that would net us any mypy/pyright wins

sureshjoshi added a commit to sureshjoshi/pants that referenced this issue Mar 15, 2024
sureshjoshi added a commit to sureshjoshi/pants that referenced this issue Mar 15, 2024
…ll comments and whitespace

- Could use the AST representation to replace lines, or jump out to tokenize to perform the replacement
@sureshjoshi
Copy link
Member

Alright, so having a bit too much fun with this. In my branch, I wrote a basic script to play around with how we can do this migration (outside of Pants, as iterating in Pants right now takes a long time).

It looks for @rule async functions, and then accounts for single-line, and multi-line GET calls, but doesn't replace them with anything meaningful just yet (waiting on Stu's reply to the previous comment before I spend time mapping code).

Single-line, single-GET

    # Before
    go_mod_addr = await Get(OwningGoMod, OwningGoModRequest(transitive_targets.roots[0].address))
    # After
    go_mod_addr = await call_by_some_name(TODO TODO TODO)

Multi-line, single-GET

    # Before
    addresses_for_thrift = await Get(
        PythonModuleOwners,
        PythonModuleOwnersRequest(
            "thrift",
            resolve=resolve,
            locality=locality,
        ),
    )

    # After
    addresses_for_thrift = await call_by_some_name(TODO TODO TODO)

@benjyw Do you recall where we landed on MultiGet? Are those to be named concurrently?

The script doesn't add imports, but that should be easy to brute force - and I have a more clever idea that I want to play around with later on.

sureshjoshi added a commit to sureshjoshi/pants that referenced this issue Mar 16, 2024
@sureshjoshi
Copy link
Member

@benjyw @stuhood Cheeky bump on this one re: concurrently? and what is the canonical single Get re-write code? above.

Not huge urgency, but just wanted to get some clean re-writes while it's all still fresh

@stuhood
Copy link
Sponsor Member Author

stuhood commented Mar 19, 2024

@benjyw @stuhood Cheeky bump on this one re: concurrently?

Yea, I think that we should go with concurrently.

and what is the canonical single Get re-write code? above.

I don't think that the resulting formatting is too important, because I think that anyone who cares about formatting will be using an auto-formatter. My guess is that many calls will be about the same length, so if you wanted bonus points, you could match the existing "multi-line-ness" of the call you are replacing. But just choosing one output wrapping strategy and then using it should be totally fine due to auto-formatters.

@sureshjoshi
Copy link
Member

sureshjoshi commented Mar 19, 2024

Thanks @stuhood but to clarify, I'm not entirely sure what I should replace the existing GET call to in the examples above.

Would it be the same as what's in the sample code above already? Or is there a new syntax?

As in, there are several ways in the examples above - any preference to which I use?

@sureshjoshi
Copy link
Member

Some notes to self while playing around:

# Where

@rule
async def get_browser(request: BrowserRequest, open_binary: OpenBinary) -> Browser:

# Transforming:

browser = await Get(Browser, BrowserRequest, request.browser_request())

# Into:

browser: Any = await get_browser(request.browser_request())
-> Pyright cannot infer type (becomes Any) 
-> Runtime TypeError: get_browser() got multiple values for argument 'request'

browser: Browser = await get_browser(request.browser_request(), **implicitly())
-> Pyright correctly infers type
-> Runtime TypeError: get_browser() got multiple values for argument 'request'

browser: Browser = await get_browser(**implicitly(request.browser_request()))
-> Success

browser: Browser = await get_browser(**implicitly({request.browser_request(): BrowserRequest}))
-> Success

@sureshjoshi
Copy link
Member

@stuhood Are Digests intended to remain Gets?

<function bsp_resolve_scala_metadata at 0x111ca4dc0>
  Get(<class 'pants.jvm.resolve.coursier_fetch.ToolClasspath'>, [<class 'pants.jvm.resolve.coursier_fetch.ToolClasspathRequest'>]) -> <function materialize_classpath_for_tool at 0x110cc50d0>
  Get(<class 'pants.jvm.jdk_rules.JdkEnvironment'>, [<class 'pants.jvm.jdk_rules.JdkRequest'>]) -> <function prepare_jdk_environment at 0x110cdfe50>
  Get(<class 'pants.engine.process.ProcessResult'>, [<class 'pants.engine.process.Process'>]) -> <function fallible_to_exec_result_or_raise at 0x10a417670>
  Get(<class 'pants.backend.scala.util_rules.versions.ScalaArtifactsForVersionResult'>, [<class 'pants.backend.scala.util_rules.versions.ScalaArtifactsForVersionRequest'>]) -> <function resolve_scala_artifacts_for_version at 0x111c479d0>

This CreateDigest isn't included:

    leak_sandbox_path_digest = await Get(
        Digest,
        CreateDigest(
            [
                FileContent(
                    cmd,
                    leak_jdk_sandbox_paths.encode("utf-8"),
                    is_executable=True,
                ),
            ]
        ),
    )

Also interesting to note that a couple of the replacements aren't in that rule, but rather a coroutine called in the function:

scala_runtime = await _materialize_scala_runtime_jars(scala_version)

async def _materialize_scala_runtime_jars(scala_version: ScalaVersion) -> Snapshot:
    scala_artifacts = await Get(
        ScalaArtifactsForVersionResult, ScalaArtifactsForVersionRequest(scala_version)
    )

    tool_classpath = await Get(
        ToolClasspath,
        ToolClasspathRequest(
            artifact_requirements=ArtifactRequirements.from_coordinates(
                scala_artifacts.all_coordinates
            ),
        ),
    )

    return await Get(
        Snapshot,
        AddPrefix(tool_classpath.content.digest, f"jvm/scala-runtime/{scala_version}"),
    )

That, in itself, isn't a big deal - since my AST code picks it up, but note to self: figure out how to lookup the mapper function

@sureshjoshi
Copy link
Member

I ran a subset of the migration (39 files converted) - Found a circular import bug, so need to fix that before I can dig into errors.

I ran into a lot of type and property errors, especially in the BSP/JDK area. In a few of the cases, I think the new syntax was underspecified, and I needed to use the explicit typing (hard to automate this, unless I have it run after each partial migration). Other errors, I don't even know where to start with - so I just reverted the code. It was like the new function just wasn't being called at all, which is strange.

Anyways, I was more curious if updating those 39 files netted any performance gains, but not yet it seems. Modding some code and re-running led to a 28 second wait with the current codebase, and 27.5 seconds with the migrated syntax.

Looking deeper into this, the current syntax I tested the migrations for is only a small fraction of what we need.

sureshjoshi added a commit to sureshjoshi/pants that referenced this issue Mar 20, 2024
@sureshjoshi
Copy link
Member

Another note to self. The ast.unparse function works differently for Python 3.12 and 3.11 🤦🏽

3.11 is "more correct"

@sureshjoshi
Copy link
Member

@stuhood @benjyw
Can I get a thumbs-up/thumbs-down on this first pass of the call-by-name on the python linters (selected these specifically, because they're easy to check if they're still working at runtime).

main...sureshjoshi:pants:20572-call-by-name-samples

I ran:

python3.11 call-by-name.py # from my other branch
pants --changed-since=HEAD fix fmt lint

By coincidence, we get examples of a few different Get cases, as well as some edge cases:

  • Shadowed imports handled by using asname in the import from pants.engine.internals.graph import coarsened_targets as coarsened_targets_get
    • _get can be bikeshedded all we want
  • Quotes in quotes falls over on Python 3.12 ast.unparse, even though it was supposedly fixed, which is why I used Python 3.11 for this: description=f"Run add-trailing-comma on {pluralize(len(request.files), 'file')}.",
    • Will look more into this later, and see what happens on Python 3.9
  • Missing Digest-related graph mappings, so I'm unable to map those gets: e.g. Get(Digest, RemovePrefix(result.output_digest, REPORT_DIR))
    • There are other missing graph mappings, but Digests are the easiest to call out right now

@stuhood
Copy link
Sponsor Member Author

stuhood commented Mar 21, 2024

That looks great, and thanks for dealing with the rough edges!


There is one simplification that it would be really good to implement. For the example:

resolved_dependencies = await resolve_parsed_dependencies(
    **implicitly(ResolvedParsedPythonDependenciesRequest(fs, parsed_dependencies, resolve))
)

... because the "implicit" argument already matches the type of one of the @rules arguments, you can directly pass it positionally, like so:

resolved_dependencies = await resolve_parsed_dependencies(
   ResolvedParsedPythonDependenciesRequest(fs, parsed_dependencies, resolve),
    **implicitly(),
)

We'd like to encourage this for a few reasons, so it would be really awesome if you were able to do it as part of the rewrite:

  1. It gives you type safety.
    • If the called method doesn't have a matching argument, type checkers will complain (unlike with implicit arguments, which might get type-converted first, or be passed to dependency @rules of the rule you are calling).
  2. It reduces the amount of solving that is necessary.
    • Positional arguments are completely ignored in @rule graph solving, which improves solving performance.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Mar 21, 2024

Having said that, passing an argument as positional may not work in 100% of cases currently: for example if one of the other arguments to a @rule also requires the argument, it would need to be passed implicitly.

So you might try doing it only for single argument @rules, or perhaps only for @rules whose other arguments are all Subsystems (meaning that they are very likely to not need the positional argument).

@stuhood
Copy link
Sponsor Member Author

stuhood commented Mar 21, 2024

  • Missing Digest-related graph mappings, so I'm unable to map those gets: e.g. Get(Digest, RemovePrefix(result.output_digest, REPORT_DIR))
    • There are other missing graph mappings, but Digests are the easiest to call out right now

@benjyw gave all of the native rules names: if they're not already exposed from the native engine, he might be able to help with exposing them.

@sureshjoshi
Copy link
Member

@stuhood Thanks for reviewing!

Positional arguments are completely ignored in @rule graph solving, which improves solving performance.

Gotcha, I didn't know this was a perf improvement, I just went with consistency of calling code assuming all else was equal. I'll take a look at carving out well-typed, single-parameters cases as positional arguments (within reason).

@benjyw gave all of the native rules names: if they're not already exposed from the native engine, he might be able to help with exposing them.

Ah interesting, I thought that this was something happening based on which backends were enabled ("so the rewritten file must actually have been loaded" from the tracking issue), and maybe I needed to enable something. I intentionally haven't dug too far into the Rust side of this, as I'm just trying to get the AST-rewriter handled first, before I dig into missing mappings.

@sureshjoshi
Copy link
Member

sureshjoshi commented Mar 24, 2024

@stuhood Trying to write some integration tests to get the migrate goal up for a PR, but I'm running into trouble as the new backend I've generated isn't getting picked up for syntax migrations?

https://github.com/sureshjoshi/pants/blob/20572-call-by-name-migration/src/python/pants/goal/migrate_call_by_name_integration_test.py

Am I missing something here from the registration? Or is there a problem trying to register as an in-repo plugin?

EDIT: Looks like putting in a contrived @goal_rule seems to kick this off? I'm not sure if that's a built-in rule specific requirement though?

sureshjoshi added a commit to sureshjoshi/pants that referenced this issue Mar 24, 2024
@stuhood
Copy link
Sponsor Member Author

stuhood commented Mar 25, 2024

BuiltinGoals are ... built in. AFAIK, there isn't a way to add them from a plugin/backend.

But to be clear: the PR I already landed adds the goal, and is already integrated AFAICT.

@sureshjoshi
Copy link
Member

I think we're talking about different things. I'm writing tests for the migrate goal you added, with the syntax AST stuff I added - as a sanity check.

I was trying to get the session.scheduler_session.rule_graph_rule_gets() to see the contrived test code I added, so that it would emit the appropriate syntax migrations, but it required adding a goal_rule to the test code before the rule graph picked it up.

Specifically I added this for my new rules to get picked up: https://github.com/pantsbuild/pants/pull/20714/files#diff-e3b5d93ad8c721f83f88479f20ff9347b17932cf10659ddc278840791adbad8dR28-R33

@huonw huonw added the external-plugins Issues related to plugins outside this repository label Apr 1, 2024
sureshjoshi added a commit that referenced this issue Apr 12, 2024
This PR creates a native Python built-in goal which performs the call-by-name migration for passed in files/targets (covers items 1 and 4 in #20572).

Note: This is not a completed feature, as several bugs have been split out into sub-tickets for investigation.

The simplest use case is: `pants migrate-call-by-name --json ::`
@sureshjoshi
Copy link
Member

sureshjoshi commented Apr 12, 2024

I believe this ticket is complete and recently merged. The overall feature isn't complete until 2 outstanding tickets/investigations are completed, and we actually run it over the whole code-base - but I think 1 through 4 are done as pants migrate-call-by-name --json ::

See #20744 - Not entirely sure this is needed, haven't looked into it enough yet.
See #20745

@stuhood
Copy link
Sponsor Member Author

stuhood commented Apr 16, 2024

🎊

@benjyw
Copy link
Sponsor Contributor

benjyw commented May 28, 2024

Re MultiGet - I guess it should be called gather a la asyncio.gather ? In fact, could we just use asyncio.gather?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-plugins Issues related to plugins outside this repository
Projects
None yet
Development

No branches or pull requests

4 participants