Introduce new API in resolvelib to Optimize Pip's Dependency Resolution #12497

notatallshaw · 2024-01-30T02:01:07Z

What's the problem this feature will solve?

Pip's dependency resolution process, particularly in complex backtracking scenarios, currently faces performance issues due to repetitive calculations, leading to O(n^2) complexity. This issue arises when checking if a "name" is part of a cause but also hinders the implementation of more complex checks, as they would slow down simple backtracking scenarios.

Describe the solution you'd like

The solution proposes adding a new abstract method, filter_unsatisfied_names, in resolvelib, to be implemented by Pip. This method is designed to work alongside the existing get_preference method, refining the backtracking process during dependency resolution.

filter_unsatisfied_names pre-filters unsatisfied names before they are processed by get_preference.
Unlike get_preference, which is called once for each name, filter_unsatisfied_names is called only once per backtracking round.
All unsatisfied names are provided to filter_unsatisfied_names, in contrast to get_preference, which receives one name at a time.
Hence filter_unsatisfied_names is able to short-circuiting checks.
And filter_unsatisfied_names can determine if multiple names are important to backtrack on simultaneously.

Alternative Solutions

Enhancing get_preference: Improvements to this method could offer some benefits, but it would struggle to enable complex backtracks without imposing performance penalties on simple backtracks.
Implementing Caching: While caching can reduce redundant work, it might lead to increased memory usage and added complexity.

Additional context

I am not tied to the name filter_unsatisfied_names or even to this specific approach, as long as it's possible to filter unsatisfied names more than one at a time.

A PR is open on the resolvelib side, but it is unlikely to ever be merged without buy-in from Pip maintainers: sarugaku/resolvelib#145

Code of Conduct

I agree to follow the PSF Code of Conduct.

The text was updated successfully, but these errors were encountered:

pfmoore · 2024-01-30T07:03:23Z

Two points here.

I don’t see anything in the linked resolvelib discussion suggesting that they are waiting for “buy-in from pip maintainers”. But if it helps, I’m happy to say we should use the method if it’s implemented in resolvelib.
This isn’t about naming, but I don’t know what it means for a name to be “unsatisfied”. I assume the resolvelib PR will be updated to document the meaning of this term before it’s approved.

notatallshaw · 2024-01-30T07:19:28Z

I don’t see anything in the linked resolvelib discussion suggesting that they are waiting for “buy-in from pip maintainers”. But if it helps, I’m happy to say we should use the method if it’s implemented in resolvelib.

Pip is the main driver of the resolvelib. And this tagently related issue (different problem, possibly same solution), has been waiting on feedback from Pip maintainers: sarugaku/resolvelib#134. So I don't see how this would be any different.

This isn’t about naming, but I don’t know what it means for a name to be “unsatisfied”. I assume the resolvelib PR will be updated to document the meaning of this term before it’s approved.

I would gladly take another suggestion on a name, it's simply a reference to the existing variable unsatisfied_names: https://github.com/sarugaku/resolvelib/blob/main/src/resolvelib/resolvers.py#L422 Which is the name of all dependencies which do not yet have a satisfying solution.

Happy to fill out the docstring to explain that in more detail.

pfmoore · 2024-01-30T10:44:14Z

And this tagently related issue (different problem, possibly same solution), has been waiting on feedback from Pip maintainers

I added a comment on that issue.

As I said, it’s not about names. What I want is a user-facing explanation of the term “unsatisfied”. Something that will let future pip maintainers review and maintain this function without needing to read and understand the resolvelib code.

I have to say, I’m struggling to understand why this is so hard to get across. The whole point of having a separate library for the resolver is so that pip only needs to interact with a clear, documented API. I’m not being obtuse here -I genuinely do not know what it means for a name to be “unsatisfied”.

Which is the name of all dependencies which do not yet have a satisfying solution.

So, what? A provider implementation’s filter_unsatisfied_names method will be called during the resolve by resolvelib, and it’s passed a list of names for which the algorithm hasn’t yet found a solution. The provider should return a filtered copy of this list. How? What is the client expected to achieve? What happens, for example, if I just return the first name? That’s filtered, isn’t it? If the client “gets it wrong” will the result be an incorrect solution?

The most critical thing I want to understand is what the risk is, if a future pip maintainer makes a change that impacts this function (either directly or indirectly) and gets it wrong. And following on from that, what information is available to minimise the risk that they get it wrong in the first place.

notatallshaw · 2024-01-30T14:06:21Z

I added a comment on that issue.

Appreciate it!

I have to say, I’m struggling to understand why this is so hard to get across. The whole point of having a separate library for the resolver is so that pip only needs to interact with a clear, documented API. I’m not being obtuse here -I genuinely do not know what it means for a name to be “unsatisfied”.

I'm struggling to understand what you don't understand sorry, this is the same term used in the new resolver since it was made default in late 2020, and means the same thing, I picked that name because it was already a known term in the resolver codebase.

As I said, I'm more than happy if someone thinks there is a more descriptive name, I am not a fan of it, but I thought anyone who had to deal directly would resolution steps would at least understand it.

A provider implementation’s filter_unsatisfied_names method will be called during the resolve by resolvelib, and it’s passed a list of names for which the algorithm hasn’t yet found a solution. The provider should return a filtered copy of this list. How?

It doesn't need to be a copy, the provider can just return the list straight back and they get the same current behavior. I don't know what you mean by "How?", any way the provider wants to filter, or not filter, that list.

What is the client expected to achieve?

That the next step of the backtracking phase will only backtrack on those packages the client returned in the list.

What happens, for example, if I just return the first name?

Then the next step of the backtracking phase will only focus on that first name.

In real world performance for Pip it would affect the performance like so:

For complicated backtracks - Likely there will be a lot more resolution steps
In simple backtracks - Probably not too much difference, likely minor performance benefit from get_preference only being called once per resolution step, instead of upto n
In some random cases much much faster due to luck that being a good way to solve that set of requirements

That’s filtered, isn’t it?

Yup, if that's how the client wants to focus the resolve that is completely valid. Honestly there may well be real world use cases for this approach, something about the client wanting to force resolution in the exact order they provided it without resolvelib checking get_preference on each name,

That’s filtered, isn’t it? If the client “gets it wrong” will the result be an incorrect solution?

As long as the client returns a non-empty subset of the name list (including upto the list itself), and as long as there are no bugs in resolvelib, and ignoring arbitary limits like ResolutionTooDeep, then a correct solution will eventually be found.

The most critical thing I want to understand is what the risk is, if a future pip maintainer makes a change that impacts this function (either directly or indirectly) and gets it wrong. And following on from that, what information is available to minimise 1the risk that they get it wrong in the first place.

What's new is if they fail to honor returning a non-empty subset of the name list then resolvelib will throw an exception. Otherwise the pitfalls are exactly the same as get_preference, it could cause resolution to take a potentially impractical amount of time to complete.

pfmoore · 2024-01-30T15:06:16Z

I'm struggling to understand what you don't understand sorry, this is the same term used in the new resolver since it was made default in late 2020, and means the same thing, I picked that name because it was already a known term in the resolver codebase.

That's partly my fault - I was focusing on the term "unsatisfied" when my confusion is actually more with the idea of "filtering". The rest of my comment was hopefully a little clearer. But having said that, my point here is that while it may be a known term in the resolver codebase (by which I presume you mean resolvelib) it's not well-known in the pip codebase (where, in fact, "satisfied" is typically used in the sense that a requirement is satisfied by an already-installed package). I'm trying to make sure that we don't end up in a situation where understanding the internal workings (and terminology) of resolvelib is a prerequisite for working on pip's resolution code. Maybe the issue is that because you are so familiar with the details of the resolution algorithm, you're not seeing the distinction I'm trying to make?

It doesn't need to be a copy, the provider can just return the list straight back and they get the same current behavior. I don't know what you mean by "How?", any way the provider wants to filter, or not filter, that list.

Let me be explicit then. If the list of unsatisfied names is ["foo", "bar", "baz"], how would pip know whether it's a good idea to remove "baz" from that list?

Again, this is a genuine question - I really have no idea how I'd write a useful filter_unsatisfied_names method. It's similar to the problem with get_preference¹, but worse, because at least with "preference" there's an intuitive idea of what it means to "prefer" looking at a particular candidate.

As long as the client returns a non-empty subset of the name list (including upto the list itself), and as long as there are no bugs in resolvelib, and ignoring arbitary limits like ResolutionTooDeep, then a correct solution will eventually be found.

Thank you. That information is important and should very definitely be made clear in the resolvelib documentation. It means that filter_unsatisfied_names is purely an optimisation, and will never change the result of the resolve() call². Until now, I wasn't aware that this was the case.

But if it's not allowed to return an empty subset, that should be explicitly noted in the documentation, as well. And I hope it will trigger an immediate exception in resolvelib, rather than just doing the wrong thing because the data is bad.

Which is also, IMO, not explained well enough in the resolvelib docs. ↩
Excepting, as you say, ResolutionTooDeep limits and bugs. ↩

notatallshaw · 2024-02-01T02:21:31Z

Maybe the issue is that because you are so familiar with the details of the resolution algorithm, you're not seeing the distinction I'm trying to make?

Almost certainly this is contributing the less than ideal communication, but this discussion has been very fruitful I think, it's given me specific updates I can go back and apply to the resolvelib PR so hopefully it will be easier for Pip to maintain interaction with this API in the future.

Let me be explicit then. If the list of unsatisfied names is ["foo", "bar", "baz"], how would pip know whether it's a good idea to remove "baz" from that list?

I think the problem I have answering this question is because the genuine answer is: using a deep understanding of Python packaging, it's ecosystem, the algorithm type resolvelib employs, and SAT-algorithms.

The API I propose in the PR is almost identical to get_preference, it provides the names and a bunch of objects that reflect the current state of the resolver algorithm, and in turn those objects are themselves derived from the objects that Pip provided to resolvelib to resolve. So the answers are the same as to the question "with get_preference how would Pip know if it's a good idea to not prefer baz?".

So as examples:

You may want to prefer a pinning requirement over a non-pinning requirement
It's important to know that visiting new candidates could be highly expensive, if not cached Pip may have to download gigantic wheels or build sdists
resolvelib employs it's important to understand it's DFS during backtracking and choices must be made in that context
It's helpful to know that Python packaging specifiers are unusual and doing static analysis varies significantly from any common literature on SAT-solvers
Having an understanding of CDCL SAT helps (I find these steps and diagrams particularly intuitive: https://en.wikipedia.org/wiki/Conflict-driven_clause_learning#Example) as ideas from it can most easily be applied to resolvelib's approach

Again, this is a genuine question - I really have no idea how I'd write a useful filter_unsatisfied_names method. It's similar to the problem with get_preference1, but worse, because at least with "preference" there's an intuitive idea of what it means to "prefer" looking at a particular candidate.

Agreed the name isn't intuitively unhelpful, any suggestions would be welcome: refine_resolution? narrow_backtrack_candidates?

pfmoore · 2024-02-01T07:52:22Z

Thanks. I guess my reservation is on the pip side then, as while I see that this can help a lot with resolution issues, it’s going to be critical that pip’s definition of this method is maintainable (without needing the sort of deep knowledge you possess - the pip code base is already far too inaccessible to new contributors). Let’s wait until there is a pip PR before discussing that in detail¹.

As to the name, narrow_backtrack_candidates is my preference out of the ones you suggested, but it’s only a mild preference. The real key is going to be having good documentation on the resolvelib side.

But for comparison, pip’s implementation of get_preference is a good example of a clear implementation that can be understood and maintained without in-depth resolver knowledge. ↩

notatallshaw · 2024-02-01T13:39:31Z

Let’s wait until there is a pip PR before discussing that in detail1

The PR exists, #12499, I will update it and the resolvelib PR some time over the next week with the feedback from here.

I raise this PR before the resolvelib one is merged, released, and vendored exactly to iron out any issues with Pip (as the main customer of resolvelib) before I push for the resolvelib one to be merged. That PR includes 3 commits: 1) Add new API with no behavior change, 2) Move some existing optimizations over to new API to get speed ups, 3) Add new optimizations (as outlined by #12498) to get massive speed up in complex backtracking cases.

pfmoore · 2024-02-01T17:33:02Z

Ah, OK. I'm afraid I don't really follow the logic in the new provider method in that PR. I'll try to take a proper look another day but my first reactions are:

The filter_unsatisfied_names method is documented in terms of "prefer backtracking on..." which is jarring to me. I'd much rather the docstring and comments be worded in terms of what the code is filtering out (and why it's doing so) and not leave it to the reader to try to work out the link from backtracking to filtering.
I can't understand the supporting functions. This is probably because they make heavy use of the PreferenceInformation data, which is in need of documentation. I'm going to hope for now that the resolvelib PR will address this, and properly review those functions in light of whatever additional documentation gets added.

Beyond that, I'll wait till I have some more time to do a proper review, rather than just a "quickly dash something off while I'm making food" disjointed ramble 🙂

notatallshaw · 2024-02-01T19:41:24Z

Yeah, based on the feedback here I'm going to take quite a bit of time updating the docstrings

notatallshaw added S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature labels Jan 30, 2024

This was referenced Jan 30, 2024

Optimize Backtracking in Pip's Dependency Resolution By Prioritizing Direct Conflicts #12498

Open

Implement preferring closest conflicting causes for backtracking resolution #12318

Closed

notatallshaw mentioned this issue Feb 3, 2024

Allow provider to filter unsatisfied names, when backtracking sarugaku/resolvelib#145

Open

notatallshaw mentioned this issue Feb 11, 2024

Prefer direct conflicting causes when backtracking #12499

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new API in resolvelib to Optimize Pip's Dependency Resolution #12497

Introduce new API in resolvelib to Optimize Pip's Dependency Resolution #12497

notatallshaw commented Jan 30, 2024 •

edited

pfmoore commented Jan 30, 2024

notatallshaw commented Jan 30, 2024 •

edited

pfmoore commented Jan 30, 2024

notatallshaw commented Jan 30, 2024 •

edited

pfmoore commented Jan 30, 2024

notatallshaw commented Feb 1, 2024 •

edited

pfmoore commented Feb 1, 2024

notatallshaw commented Feb 1, 2024 •

edited

pfmoore commented Feb 1, 2024

notatallshaw commented Feb 1, 2024

Introduce new API in resolvelib to Optimize Pip's Dependency Resolution #12497

Introduce new API in resolvelib to Optimize Pip's Dependency Resolution #12497

Comments

notatallshaw commented Jan 30, 2024 • edited

What's the problem this feature will solve?

Describe the solution you'd like

Alternative Solutions

Additional context

Code of Conduct

pfmoore commented Jan 30, 2024

notatallshaw commented Jan 30, 2024 • edited

pfmoore commented Jan 30, 2024

notatallshaw commented Jan 30, 2024 • edited

pfmoore commented Jan 30, 2024

Footnotes

notatallshaw commented Feb 1, 2024 • edited

pfmoore commented Feb 1, 2024

Footnotes

notatallshaw commented Feb 1, 2024 • edited

pfmoore commented Feb 1, 2024

notatallshaw commented Feb 1, 2024

notatallshaw commented Jan 30, 2024 •

edited

notatallshaw commented Jan 30, 2024 •

edited

notatallshaw commented Jan 30, 2024 •

edited

notatallshaw commented Feb 1, 2024 •

edited

notatallshaw commented Feb 1, 2024 •

edited