Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add maximum_common_ordered_subtree_embedding to algorithms.minors #4327

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

Erotemic
Copy link
Contributor

@Erotemic Erotemic commented Nov 9, 2020

New version of #4169 where embedding is reorganized under the minors package. Please see original PR for discussion.

This PR depends on #4326 #4294 and #4349 no longer has any dependencies!

This PR is depended on by #4350


EDIT: After starting the isomorphism PR I realized this PR could use some slight tweaks.


Overview:

Summary

This PR is for a new algorithm which I don't believe exists in networkx:

  • Maximum ordered common subtree embedding: Given two ordered trees: G and H, find the largest embedding G' of G and H' of H where H' is isomorphic to G'. A tree G' is an embedded subtree of G if G' can be obtained from G by a series of edge contractions.

This algorithm is restricted to ordered trees because it is APX-hard for any other relaxation of the input types.

Modivation

When working with pytorch-based neural networks I found that I often encounter a problem where I have some custom module that uses resnet50 as a component, and I would like to simply start from existing resnet50 pretrained weights. However, to do that the "state_dict" of the model must match the "state_dict" of the saved pytorch weights file. But because my resnet50 model is a component, the keys don't exactly line up.

For instance, the keys in the resnet file may look like this:

        >>> # This means you can load an off-the-shelf unmodified pretrained resnet50
        >>> # where the keys might look something like this:
        >>> resnet_keys = {
        >>>     'conv1.weight',
        >>>     'layer1.0.conv1.weight',
        >>>     'layer1.0.conv2.weight',
        >>>     'layer1.0.conv3.weight',
        >>>     'layer1.0.downsample.0.weight',
        >>>     'layer2.0.conv1.weight',
        >>>     'layer2.0.conv2.weight',
        >>>     'layer2.0.conv3.weight',
        >>>     'layer3.0.conv1.weight',
        >>>     'layer4.0.conv1.weight',
        >>>     'fc.weight',
        >>>     'fc.bias',
        >>> }

And the "model-state" for the module may look like this:

        >>> # And perhaps you have a model that has a state dict where keys
        >>> # look like this:
        >>> model_keys = {
        >>>     'preproc.conv1.weight'
        >>>     'backbone.layer1.0.conv1.weight',
        >>>     'backbone.layer1.0.conv2.weight',
        >>>     'backbone.layer1.0.conv3.weight',
        >>>     'backbone.layer1.0.downsample.0.weight',
        >>>     'backbone.layer2.0.conv1.weight',
        >>>     'backbone.layer2.0.conv2.weight',
        >>>     'backbone.layer2.0.conv3.weight',
        >>>     'backbone.layer3.0.conv1.weight',
        >>>     'backbone.layer4.0.conv1.weight',
        >>>     'head.conv1'
        >>>     'head.conv2'
        >>>     'head.fc.weight'
        >>>     'head.fc.bias'
        >>> }

Now, yes I could do (and have done) a hacky solution that tries removing common prefixes, but I wanted something more general. Something that would "just work" in almost all cases. After thinking about it for awhile I realize that this was a graph problem. I can break up the components by the "." and make a directory like tree structure.

Now, I want to ask the question, what is the biggest subgraph that these two directory structures have in common. After searching for awhile I found the Maximum common induced subgraph problem, which is NP-hard for graphs. But we have trees, so can we do better? I found the Maximum Common
Subtree Isomorphism
and Maximum Common Subtree but I wasn't able to follow any of these links to get an algorithm working. (although perhaps the first one is worth revisiting).

Eventually I found the paper: On the Maximum Common Embedded Subtree
Problem for Ordered Trees
, which outlines a polynomial time algorithm for finding maximum common embedded subtrees. The paper was written in a way that I could follow it, but unfortunately I missed the detail that an embedding --- although similar to --- is not an isomorphic subgraph.

However, the algorithm for finding a Common Embedded Subtree does still work well for solving my problem. The paper also links to external resources where they do tackle the actual common subtree isomorphism problem, but unfortunately they were all behind paywalls. This PR only addresses the Embedded Subtree problem.

However, I do believe I was able to modify the recurrence for the maximum common embedding into one that does produce a maximum common isomorphism, and I've empirically verified that it works on a few thousand randomly generated trees. The PR for that function is in #4350

Current Summary of Modifications

I'll maintain a top-level summary of modifications in this comment.

  1. networkx/algorithms/__init__.py - expose string and embedding modules

  2. networkx/algorithms/minors/__init__.py - new algorithm submodule for embedding problems

  3. networkx/algorithms/minors/tree_embedding.py - ⭐ implements reduction from graph problem to string problem. Defines the function which is the main API-level contribution of this PR: maximum_common_ordered_subtree_embedding.

  4. networkx/algorithms/minors/tests/test_tree_embedding.py - associated tests for tree embeddings

  5. networkx/algorithms/string/__init__.py - new algorithm submodule for string problems

  6. networkx/algorithms/string/balanced_sequence.py - Helpers for decomposing balanced sequences.

  7. networkx/algorithms/string/balanced_embedding.py - ⭐ core dynamic program to solve the maximum common balanced subsequence problem. This is the main algorithmic component of this PR.

  8. networkx/algorithms/string/balanced_embedding_cython.pyx - optional, but faster cython version of balanced_embedding.py

  9. networkx/algorithms/string/tests/test_balanced_sequence.py - associated tests for balanced sequences

  10. examples/applications/path_correspondence.py - demonstrates how to solve the path embedding problem using tree embedding. This file likely needs further reorganization, or possibly a separate PR, the rest of the PR does stand alone without this.

  11. setup.py - registered string subpackages.

@Erotemic Erotemic changed the title Dev/minor reorganization Minor reorganization (alternative to #4169) Nov 11, 2020
@Erotemic Erotemic force-pushed the dev/minor_reorganization branch 2 times, most recently from d7bd227 to 13591d2 Compare November 15, 2020 18:40
@Erotemic Erotemic changed the title Minor reorganization (alternative to #4169) Add maximum_common_ordered_subtree_embedding to algorithms.minors Nov 15, 2020
@Erotemic
Copy link
Contributor Author

Rebased this on master and updated based on comments from #4294 . This still depends on #4349, so that should get reviewed first.

I'll probably also remove the example in the applications folder and put that into a separate PR to make this that much easier to review.

There is also the issue of the cython backend and the _autojit module I wrote in this PR. I may separate that into a separate PR as well, but I'm also somewhat hesitant to do that and I'm leaning towards instead forcing that issue here because this algorithm does get reasonable gains from using a cython implementation.

Base automatically changed from master to main March 4, 2021 18:20
@Erotemic
Copy link
Contributor Author

@rossbar @dschult Now that #4349 is merged, I've rebased on main, so now the meat and potatoes contribution is finally ready for review!

I anticipate that the alternative Cython implementation and how that should or should not hook into the main API is going to be the most hotly contested issue.

I really do want to push for its inclusion, as I think a reasonably fast backend is necessary for this algorithm to have any real-world usage. The pure python version is just too slow for moderately sized graphs. I would love it if the "autojit" logic I included, but I'm open to forcing the user to compile the c-extension themselves if they want to use it. I think the current state is the best tradeoff where it attempts to compile and use the c-extension if possible, but then it falls back to the python implementation if it can't do that.

There is the issue of my use of "ubelt" in the autojit logic (which I use for making sure the autojit is only attempted once among other things). For now I haven't factored it out because I want to hear opinions on it. If we want to keep some sort of the autojit functionality (I would vote for this), I can factor out ubelt. But if we decide to remove it, then ubelt would be removed with it.

@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 7, 2021

@rossbar @dschult I took some time to remove the cython backend to make this easier to review and less controversal to merge. So now there are only two implementation backends: "iter" and "recurse". I noted that in the future a "cython-iter" backend may be added, but that's the extend to which I mentioned it. The patch to add it back in is fairly small, so it probably makes sense to discuss that in a separate thread and this can be a more streamlined review of the pure python implementations of the algorithm.

EDIT: I also fixed the doc errors that were causing the CI to fail as well as some misc inconsistencies I noticed when reviewing the PR.

@Erotemic
Copy link
Contributor Author

Erotemic commented Mar 7, 2022

@rossbar pinging to see if there is still interest in merging in this algorithm.

Note, that I recently won 3rd place in the pytorch 2021 hackathon for an application of this algorithm: https://devpost.com/software/torchliberator-partial-weight-loading

@rossbar
Copy link
Contributor

rossbar commented Mar 7, 2022

Honestly I'm not sure - this PR is huge so it's tough to get any sense of what's in it. Some things that jumped out to me:

  • There's mention of cython in the PR discussion, but there doesn't seem to be any other cythonized/autojit code in the diff (less a mention in a module docstring)...
  • There's a significant expansion of dependency footprint e.g. for the example, which also uses a pattern that none of the other examples use (e.g. calling the examples with a flag). There's nothing wrong with this, but given that it's wrapped up with everything else it's hard for me to get a sense of how all of the parts fit together and which set of changes is necessary for which feature.
  • My knee-jerk reaction is that the API/examples are pretty specialized - are the algorithm(s) really only relevant for file paths and weights of PyTorch networks? I think the case for inclusion in NX would be compelling if the functions were designed to work on NX data structures (e.g. ordered trees represented as graphs) rather than lists of strings.

These are the impressions that I got from just skimming the diff & PR discussion - there's so much here that I really struggle getting a handle on the proposed changes. If you were amenable to it, I think one thing that might really help w/ review would be to pare this down as much as possible to the minimal proposed change i.e. what is the absolute minimum amount of changes necessary to add the maximum_common_ordered_subtree_embedding function (assuming that's the main payload of the PR) - just fn+docstring and tests. All of the supporting material could be moved to followup PRs which will (hopefully) make it easier to get a sense for the proposal. I'd also encourage to open a new PR as I think the discussion here has been pretty overloaded (see e.g. the comments about Cython, which no longer appear to be relevant?)

@Erotemic
Copy link
Contributor Author

I've been thinking a lot about this, and I think this algorithm might be better served as an extension to networkx rather than belonging as a core part of it. This is mainly because it relies on the C implementation to be efficient.

To this end I published a standalone version of the algorithms on pypi with both the pure python and cython implementations. https://github.com/Erotemic/networkx_algo_common_subtree

I think the algorithm is about as specialized as some of the other algo that come bundled with networkx, the key thing it does is given two trees: find some common substructure. In this case that is a graph minor that is common between both trees. The new library I made also contains the variant of this problem where the common substructure are subtree-isomorphisms. The algorithm is definitely not only relevant to pytorch weights. That's just my motivating use case.

The reason the algorithm isn't designed to work directly on the networkx data structures is because the algorithm I implemented is based on a reduction from an ordered tree to a "balance sequence" - hence why there is a lot of futzing with strings / Lists.

Another reason I'm inclined to close this is to prevent bloat in networkx itself. While Lozano's 2003 algorithm solves a fairly fundamental graph algorithm, there are so many fundamental graph algorithms that it might not make sense for networkx to contain them all. However, whenever I come across a graph problem I always scan the networkx docs to make sure that an algorithm for it isn't already implemented, because networkx does have a wide array of core algorithms. I'm wondering if there should be a mechanism for third party implementations to be listed in the core networkx docs (perhaps in an extensions / contrib section) that can make it easier for someone searching for a specific graph algorithm to find an existing implementation of it?

I guess my thought is that it would be nice if there was a centralized location that documents and references a distributed set of implementations that compute solutions to the zoo of graph problems that exist.

@rossbar
Copy link
Contributor

rossbar commented Jul 22, 2022

I'm wondering if there should be a mechanism for third party implementations to be listed in the core networkx docs (perhaps in an extensions / contrib section) that can make it easier for someone searching for a specific graph algorithm to find an existing implementation of it?

I'm of two minds here. On the one hand it would be great if users had a resource that collated the full atlas of available graph algorithms across libraries! I think this would be extremely challenging to maintain however. I'm hesitant to link to external projects for many reasons; but most notably is that a) we don't have the bandwidth to keep track of the maintenance status of external projects and b) API difference between projects might make references to them less useful for users.

The lightest-weight way I can think to do this would be to add links to external Python projects to the See Also section of docstrings via intersphinx links. Even in this case though I'd be hesitant to do so for the above reasons.

Now, if someone were to make a knowledge graph (ha) of graph theoretic/networkx analysis algorithms and tag the nodes with examples of concrete implementations across libraries, I think that would be an awesome project. We've got an open issue about doing this within networkx (see #5824), but if someone wanted to take that idea expand it to include cross-library links that'd be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants