-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algos for max ordered common subtree embedding #4169
Conversation
Hello @Erotemic! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-10-12 20:36:26 UTC |
Nice! There's a lot here. We would be interested. There is recent interest in tree isomorphism so that provides a place to put it too. Note: we already have monomorphisms and embeddings in the isomorphisms folder -- we may have to refactor some day, but today is not that day. On obvious place to put it would be in Alternatively, if it makes more sense to you or is easier (we can refactor if needed later) you can make a module Either way, put a hook into If you have questions, just ask. :) |
Awesome, I'll put some work into cleaning the code up. Now that I'm more familiar with this problem I keep running into more an more places where this algorithm will be useful. Unfortunately, the current implementation I have seems too slow. It seems to only scale to graphs with < 500 nodes. I've spent a good deal of time profiling the code and optimizing where I can, but its unsatisfactory. The running time of the algorithm is reported as For reference the fit coefficients are:
I'm not sure if the second order is fitting well enough. It does seem to imply that I'm getting the correct running time in terms of graph sizes. So perhaps my constant overhead is too big. Currently the algorithm is using a recursive call with a memoize dictionary. I don't see an obvious way to transform it into an iterative scheme, but if there is a way to do that, then that should help the runtime by a lot by reducing the number of python function calls that need to be made. For this to be really useful it should be able to scale to graphs with 10s of thousands of nodes. Current projections estimate that for 10,000 nodes that could take 5, 39, or 111 hours, which is far too slow. What is the policy on cython backend implementations for this repo? I think if the algo gets a 100x speedup from a compiled language it will be real-world useful (assuming its the Python calls that are the real problem). If anyone has other ideas on how to speed it up, I'd be happy to hear them. |
Recursive code is good for proofs, not so good for performance. All recursive code can be written without recursion -- the idea is to replace the recursive function call with a stack to hold the state and then make a loop that runs the function and moves things on or off the stack when the recursive function would have been called or returned from. Easy to say, often quite head-bending to implement. But it does speed things up quite a bit. You say you've done some profiling... What are the bottlenecks? Finally, run-time asymptotics is 1) valid in a limit so sometimes not accurate for real-world sized "n" and 2) sometimes easy to mix up when coding -- putting a seemingly innocent idiom that bumps up the exponent in the polynomial. To me, the data looks faster than quadratic... but that's just using my "eye-ball norm" :} I haven't looked through the code in detail, but a first pass suggests that you spend a lot of code packing and unpacking tuples of values... In Python calling functions is often a bottleneck (in my experience) and packing and unpacking tuples is similar (the args to a function are packed to make the call and then unpacked to start the function. |
For some reason I thought there was a case where it was only possible to perform a task without recursion, but now that I'm refreshing myself on the details, I think I was confusing that with the result where tail-recursive programs can be rewritten as loops without stacks. I always struggled with the bit of dynamic programming where you transformed the recursive function into a iterative one, but I'm sure I can work through it and it will be a good exercise for me. I've seen the speed benefit of doing so, so it will be interesting to see how much an impact it makes on this problem. Given the above, I'm going to spend most of my effort working on the translation to an iterative algorithm, but I'll answer the other questions as well: The algorithm works by translating the tree problem into a string/sequence problem. The conversion to and from these sequences is very fast. The tokens in the sequence don't actually need to be the same as the nodes in the tree, they just need to (1) represent them (2) have a way to denote if the token is an opening or closing token and (3) have a way to map back to the original Python node value. So, its actually reasonable amenable to translate the code into cython. The only caveat would be if the user wanted to supply a "node_affinity" function, in which case we would need to jump back into a python call. But for normal cases all of the python objects can technically be abstracted away. Instead of using a python string I represent the sequence as a tuple of items (either strings, integers, or 2-tuples they all profile about the same). The reason for using tuples instead of lists to represent the sequences is that the sequences themselves need to be hashed for the memoization (I believe this will have to happen in the iterative algorithm as well). At one time I thought I could enumerate all possible sequences that the algorithm would end up needing (which was actually my first attempt at turning this into an iterative algorithm), but it turned out I missed a case and I gave up on that route, perhaps its worth taking a second look at it. Here is the output of the Function: _lcs at line 647
Line # Hits Time Per Hit % Time Line Contents
==============================================================
647 @profile
648 def _lcs(seq1, seq2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo):
649 80481 55118.0 0.7 3.6 if not seq1:
650 5577 3929.0 0.7 0.3 return (seq1, seq1), 0
651 74904 47199.0 0.6 3.1 elif not seq2:
652 1201 886.0 0.7 0.1 return (seq2, seq2), 0
653 else:
654 # if len(seq2) < len(seq1):
655 # seq1, seq2 = seq2, seq1
656 # key = (seq1, seq2)
657 73703 276552.0 3.8 18.2 key1 = hash(seq1) # using hash(seq) is faster than seq itself
658 73703 304711.0 4.1 20.0 key2 = hash(seq2)
659 73703 60020.0 0.8 3.9 key = hash((key1, key2))
660 73703 58327.0 0.8 3.8 if key in _memo:
661 37960 28577.0 0.8 1.9 return _memo[key]
662
666 35743 25909.0 0.7 1.7 if key1 in _seq_memo:
667 35342 28331.0 0.8 1.9 a1, b1, head1, tail1, head1_tail1 = _seq_memo[key1]
668 else:
669 401 12308.0 30.7 0.8 a1, b1, head1, tail1 = balanced_decomp_unsafe2(seq1, open_to_close)
670 401 390.0 1.0 0.0 head1_tail1 = head1 + tail1
671 401 398.0 1.0 0.0 _seq_memo[key1] = a1, b1, head1, tail1, head1_tail1
672
673 35743 26558.0 0.7 1.7 if key2 in _seq_memo:
674 35340 28794.0 0.8 1.9 a2, b2, head2, tail2, head2_tail2 = _seq_memo[key2]
675 else:
676 403 13014.0 32.3 0.9 a2, b2, head2, tail2 = balanced_decomp_unsafe2(seq2, open_to_close)
677 403 402.0 1.0 0.0 head2_tail2 = head2 + tail2
678 403 382.0 0.9 0.0 _seq_memo[key2] = a2, b2, head2, tail2, head2_tail2
679
680 # Case 2: The current edge in sequence1 is deleted
681 35743 61089.0 1.7 4.0 best, val = _lcs(head1_tail1, seq2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
682
683 # Case 3: The current edge in sequence2 is deleted
684 35743 55926.0 1.6 3.7 cand, val_alt = _lcs(seq1, head2_tail2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
685 35743 28498.0 0.8 1.9 if val_alt > val:
686 17386 12013.0 0.7 0.8 best = cand
687 17386 11350.0 0.7 0.7 val = val_alt
688
689 # Case 1: The LCS involves this edge
690 35743 30563.0 0.9 2.0 t1 = open_to_tok[a1[0]]
691 35743 30056.0 0.8 2.0 t2 = open_to_tok[a2[0]]
693 35743 174729.0 4.9 11.5 affinity = node_affinity(t1, t2)
694 35743 26802.0 0.7 1.8 if affinity:
695 4497 7392.0 1.6 0.5 new_heads, pval_h = _lcs(head1, head2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
696 4497 6953.0 1.5 0.5 new_tails, pval_t = _lcs(tail1, tail2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
697
698 4497 3597.0 0.8 0.2 new_head1, new_head2 = new_heads
699 4497 3188.0 0.7 0.2 new_tail1, new_tail2 = new_tails
700
701 4497 5169.0 1.1 0.3 subseq1 = a1 + new_head1 + b1 + new_tail1
702 4497 4671.0 1.0 0.3 subseq2 = a2 + new_head2 + b2 + new_tail2
703
704 4497 3249.0 0.7 0.2 cand = (subseq1, subseq2)
705 4497 3469.0 0.8 0.2 val_alt = pval_h + pval_t + affinity
706 4497 3380.0 0.8 0.2 if val_alt > val:
707 919 667.0 0.7 0.0 best = cand
708 919 606.0 0.7 0.0 val = val_alt
709
710 35743 26093.0 0.7 1.7 found = (best, val)
711 35743 29843.0 0.8 2.0 _memo[key] = found
712 35743 22348.0 0.6 1.5 return found I'm pretty sure I've squeezed all the performance possible out of the recursive definition, so lets see what a iterative implementation looks like. I'm changing the title of this PR to [WIP]. |
I marked this as a draft and removed WIP, which is easier for us to track the WIP status. Since WIP isn't hooked into GitHub, we used to merge and forget to remove WIP. |
I created a basic iterative algorithm, which was less difficult than I thought it was going to be. I'm not sure if there is better iteration strategy, but I basically simulated a simplified version of the recursive callstack. Depending on the test I run, it is actually slower than the recursive algorithm (for random graphs), but the iterative seems faster on larger shallow graphs. I did get one nice results where the recursive algorithm takes 100 seconds and the iterative algorithm takes 6 seconds on two fairly complex depth-10 200 node trees, so it does look like the change was worth it. I've tried several variants of the iterative algorithm, the speed of each seems to depend on the input type. I'll need to run a few more tests to characterize when the each algorithm is slow. Still a bit more work to do, but its coming along. |
Spent a little time working on characterization when the algorithm is slow / fast, but just as a quick update: Cython helps a lot. I'm getting 40-50x speedups just by using almost the exact same python code in Cython. Removing the python overhead for simple loop iteration / continue statements / variable assignment / etc. seems to go a long way. Here are a few timing comparisons: With one type of graph:
With another:
In a lot of cases the recursive algorithm is still winning over the iterative algorithm, but the Cython-iterative algorithm blows them all away. |
I think most of the engineering work on the algorithm is done. I've started a cleanup of the code. I've moved relevant files to a subfolder (Side effect of this process is I wrote a nifty script for conversion from google-style docstrings to numpy-style docstrings.) The next major step is to write all the unit-tests (many of which can be created from existing doctests), but at this stage the algorithm is technically usable. Any input or guidance on code structure as time allows would be appreciated. Main questions are:
|
The current state of this branch is getting close to completion. I've decided to restrict this to only the embedding variant of the problem, and I'll submit the isomorphism part in a separate PR, this one big enough already, but I think we can slim it down some. However, there are parts that are nearly-ready or ready for review, so it might be a good time for me to ask for some feedback. I'll post specific questions later but for now, here's roughly the state of things. I've put everything into an The other files are as follows; balanced_sequence.py - core python implementations for the longest common balanced_sequence_cython.pyx - tree_embedding.py - defines reduction from tree problem to balanced sequence path_embedding.py - defines reduction from path problem to tree problem (not demodata.py - Contains data for docstrings, benchmarks, and synthetic problems One of the first decisions I need help making is which variants of the algorithm are worth keeping. Currently I have 8 (or 8 * 2 if you count the way I'm doing the reduction as a variable). Tests on reasonbly-sized "large datasets" showed the following times:
where the second item in the "key" tuple is the implementation codename. One of the cython algorithms is performing the best, but the recursive algorithm is the next best. However, while the fastest, the recursive algorithm cannot handle trees beyond a certain depth, so the "iter-alt2" implementation seems like the one to keep in this case. In a second tests I do vary my sequence-encoding-strategry;
chr uses a string-based sequence of utf8 chars, whereas number uses a list of numbers to represent the sequence (I also tried a list of tuples but it was far worse than these methods). The chr method does seem a good deal faster, but it can only handle a maximum of 556,032 distinct nodes, so maybe that's good enough, but number would be able to scale to any number of nodes, so I don't know which one to be using as default. (for reference we want to encode a balanced sequence so we need to map each node to an opening and closing "token", so the number strategy uses @dschult @jarrodmillman: List of things to decide are:
There are still small improvements I still need to take care of, but having answers to these questions would help me complete that work. |
We currently don't use Cython and will need to make a decision whether we will start before merging this. Historically, our focus has been on simple implementations over performance. But, now is a good time for us to revisit this. We have plenty of time before the next release, so let's leave it in the PR until we revisit whether or not to change our Cython policy. We may need to create a NXEP to decide what we are going to do about optimizing code for performance. Thanks for your patience. |
This is a really big PR. It's likely to take a long time to review. We can proceed in a couple of ways. Either split up the PR into many (ugh) or split up review into parts (maybe also ugh). I'd like to consider the heart of the new algorithm, but it's hard to know where the boundary of that code is. Which parts are interlinked and which could stand separately? It doesn't have to be split into multiple github PRs, but it might be helpful to know where the interdependent algorithm code is. My guess is that What do you think about putting benchmarks, and maybe demodata into the Maybe we can start with |
It is bigger than I had hoped it would be, but it was tricky to implement efficiently as the algorithm is based on a reduction to a string problem, which means there has to be a conversion layer to and from the nx.Graph to some "string" (i.e. Indexable) Python data structure. My original version was much smaller, but also orders of magnitude times slower and prone to stack overflow. Your understanding of the dependencies is correct. The core of the algorithmic component is balanced_sequence, this handles the string-problem the tree-problem is reduced to. There are several different implementations of the same core logic (which are the protected On the core balance_sequence moduleThe core groups of functions (which may have different implementations) in balanced_sequence are:
There is also a Cython version of balanced_sequence (named balanced_sequence_cython), which implements more variations of solving the "_lcs" problem. I wrote the code in such a way that the public API's default behavior is to use cython if available, but fallback to pure python. It is the fastest algorithm by a significant margin, so I do recommend considering including it. On the prehash behaviorThere is a lot of redundant behavior in this file, and it might be able to be reduced, depending on how we want to proceed. The "prehash" stuff above is mainly useful when balanced sequences are stored as a list of tuples. The main advantage of this representation is the quick and easy-to-see access to the internals. However, when the data structure is a python str or a list[int], then prehashing does not seem to benchmark better than the other methods. Thus we could restrict the balanced sequence to only be I would have done that already, but I had to benchmark it to know what combination of data structure + logic was going to work best, and I wanted to ensure that some hash in the git history had the code so that experiment I did can be reproduced. So I wanted to note that it exists, but now that I have I'd advocate for removing it because it wont be useful in practice, but I also wouldn't object to keeping it in as an option as it may aid future development. On non-core components
path_embedding demodata enables both benchmarks and tests, I think some of the functions in demodata may be suited elsewhere: I was surprised there wasn't already a function like I think putting benchmarks in an examples folder is a good idea. I think it might also be a good idea to put path_embedding there (because a path_embedding is really just a use-case of the graph algorithm) as it doesn't make much sense to offer that functionality in a graph library. On CythonFor reasonably stable algorithms it makes sense to have a faster Cython implementation if networkx would like to improve its scalability. However, I understand the maintenance and build-time costs. A module becomes much harder to maintain once you need to worry about binaries. I've had a lot of experience with it via the kwimage, pyhesaff, and pyflann packages. Fortunately scikit-build does make it somewhat simpler to manage, but I know there is an outstanding pypy issue. I think it always makes sense to never rely on a Cython implementation, but I think it should be able to be used it if exists. Thus a Cython algorithm should always reproduce some functionality available in the pure-python realm as I've done here. My recommendation is to include the Cython implementation in this PR, but do nothing else in the time being. If the user wishes to use it, they can compile it themselves as it will be distributed with networkx. When the scikit-build pypy issue gets fixed we can talk about building manylinux, osx, and win32 wheels on CIs (we can also talk about auto-publishing and auto-GPG signing on CI if you are interested, not sure what your current release workflows are). In the unlikely case where it falls out of sync with main python it can be removed. |
@dschult @jarrodmillman: In lieu of feedback (which is understandable, I understand time constraints wrt working on OSS), I made a few decisions which will hopefully make review easier. I tended to err on the side of removing things. Current Summary of Modifications
DetailsMore details on the changes I made to arrive here:
Questions I have about the current state of the code:
|
c86a231
to
1277475
Compare
I almost wonder if Also, this PR uses xdoctest for doctests. Can you make it work with |
I like having a separate module for I've edited the main post to include the summary of all module changes, which I will keep up-to-date as this PR continues. As for xdoctest, I would really prefer to continue to use it. However, I do understand the desire to minimize dependencies, so my most recent commit does rework my doctests to be backwards-compatible (at the cost of some readability and soontobementioned issues). That being said, I do hope you consider including xdoctest the future. I do think it would be an overall improvement, and it would make writing / maintaining doctests much easier. The case for xdoctest in networkxThe main feature I'm making use of is "new-style got/want" tests, where print statements don't need to be broken up. For instance, the following works with xdoctest: Example
-------
>>> open_to_close = {'{': '}', '(': ')', '[': ']'}
>>> seq = '({[[]]})[[][]]{{}}'
>>> all_decomp = generate_all_decomp(seq, open_to_close)
>>> node, *decomp = all_decomp[seq]
>>> pop_open, pop_close, head, tail, head_tail = decomp
>>> print('node = {!r}'.format(node))
>>> print('pop_open = {!r}'.format(pop_open))
>>> print('pop_close = {!r}'.format(pop_close))
>>> print('head = {!r}'.format(head))
>>> print('tail = {!r}'.format(tail))
>>> print('head_tail = {!r}'.format(head_tail))
node = '('
pop_open = '('
pop_close = ')'
head = '{[[]]}'
tail = '[[][]]{{}}'
head_tail = '{[[]]}[[][]]{{}}' However, to refactor that to work with the builtin doctest module would require following each print statement with the text it produced: Example
-------
>>> open_to_close = {'{': '}', '(': ')', '[': ']'}
>>> seq = '({[[]]})[[][]]{{}}'
>>> all_decomp = generate_all_decomp(seq, open_to_close)
>>> node, *decomp = all_decomp[seq]
>>> pop_open, pop_close, head, tail, head_tail = decomp
>>> print('node = {!r}'.format(node))
node = '('
>>> print('pop_open = {!r}'.format(pop_open))
pop_open = '('
>>> print('pop_close = {!r}'.format(pop_close))
pop_close = ')'
>>> print('head = {!r}'.format(head))
head = '{[[]]}'
>>> print('tail = {!r}'.format(tail))
tail = '[[][]]{{}}'
>>> print('head_tail = {!r}'.format(head_tail))
head_tail = '{[[]]}[[][]]{{}}' Personally, I think the former is far more readable (you see a block of code that produces something and then you see the output as one single chunk, versus forcing humans to work like a REPL). There are also minor issues with trailing whitespace causing got/want errors in the original doctest, the fact that any non-captured variable must provided with a "want" string, and the general issue of forcing the programmer to distinguish between lines that start a statement versus are continuations of previous statements. For reference xdoctest is small, has no minimal dependencies, and is 100% compatible with the current structure of networks (in fact networkx is one of the main test-cases I used to ensure backwards compatibility when I wrote xdoctest), running |
Thanks for that description of xdoctest. And I think |
I am not sure I like having the doctests behave differently than Python. Is this a widely used package? What other projects use it? |
@jarrodmillman In case I haven't mentioned, I am the author of xdoctest. I've used On the issue of doctests behaving differently than Python's builtin doctest module, the module is almost entirely backwards compatible (I recently patched a few more corner cases that popped up after I gave the PyConn talk, and I'll patch more if I find them), its really more of an extension of Python's builtin system that allows for more flexibility when writing tests (the biggest benefit being that you don't have to manually distinguish between lines starting with |
I am not so concerned with it not behaving like the doctest module, but that it isn't how the Python interpreter works. These are supposed to be little snippets of how the code behaves at that commandline. In your example above, the output of the print statements doesn't show up after the command, but after all the print statements. I am also -1 on using |
I also like the name |
I'm confused here. The Python interpreter --- and more generally the Python language --- doesn't require or even say anything about If you input
or
to the IPython interpreter, both of them work (passing to the regular python interpreter will actually fail because neither are actually valid python syntax). There is nothing special about the PS1 and PS2 prefix from Python's perspective. What would be more accurate is to say Really, the only reason that the builtin doctest forces use of the
I don't see how any of xdoctest's features disagree with that premise. Imagine you see this snippet online: if 1:
print('hi')
if 1:
print('hi')
print('hi')
if 1:
print('hi') What you get when you paste it into an IPython terminal is So doesn't it make more sense to use the xdoctest allowed way of specifying the doctest-syntax: >>> if 1:
>>> print('hi')
>>> if 1:
>>> print('hi')
>>> print('hi')
>>> if 1:
>>> print('hi')
hi
hi
hi
hi instead of writing it the only way you can write this with the builtin doctest module? >>> if 1:
... print('hi')
hi
>>> if 1:
... print('hi')
... print('hi')
hi
hi
>>> if 1:
... print('hi')
hi While only the previous case works in the original doctest, both cases work in xdoctest. By forcing you into the latter, it actually makes it harder for users to copy-paste snippet from autodoc generated docs. In the former I can copy one block and compare all my output versus what I see on the docs. In the second case I have to copy each expression over bit by bit. When copying a snippet, I generally want to be able to paste the whole thing. However, this example does open up valid criticism: "Isn't there an ambiguity? How can you ensure that the second block did emit two 'hi's?". My response is: yes, there are rare corner cases where the checker is ambiguous. However, it is rare for these cases to emerge in the wild, and the large majority of failing outputs will be caught. Furthermore, if you are relying on a doctest for perfect testing of string matches, I would recommend you move that test to a proper unit test, because doctests are often messy and largely benefit from fuzzy matching. And if that doesn't convince you, then you can always put xdoctest into REPL mode and disable fuzzy matching to lose any ambiguity. Perhaps I'm misunderstanding your comment, but I don't see how the arguments follow. Anyway, this seems like a discussion for another thread or issue. The changes in this PR no longer require xdoctest, so the point is moot here. However, I do think networkx stands to gain from including xdoctest, so I started an issue here #4295. In news more relevant to this PR, I opened #4294 which ports the Also, I noticed in the mission statement:
So maybe that is justification for the recursive implementation of LCS to be included as an option? You can demonstrate how memoization can speed up recursive dynamic programs, but ultimately you hit a program stack limit error. It also provides an easier to understand algorithmic reference and by comparing the similarities between it and the iterative counterpart, it can be used to verify the iterative counterpart's correctness. Perhaps I should add that implementation back in as an available option? I'm going to have some time off work in the next two weeks, so I want to spend a bit of time finishing up this PR --- ensuring docs exist, splitting any other part off as needed, etc... So be on the lookout for that in the next two weeks. |
I'm +1 for including a recursive version as well as the non-recursive version. Perhaps called |
I agree that it is better to have this discussion separately. It isn't essential to this PR, which has a lot of good stuff. I also appreciate that you split this PR up to make it easier to review. It is great that you will have time soon to work on finishing this up. I am looking forward to it. FYI, I am going to be pretty busy for the rest of the year and I have a bunch of small items I want to make sure get included in the upcoming 2.6 release as well as spending most of my time working on my dissertation. So, if I am not able to keep up with your work, please don't take it as an indication that I am not interested. Our goal for the 2.6 and 3.0 releases is to remove technical debt and improve the PR review process. The goal is to start focusing more on exciting new features and algorithms after 3.0, which will be released in early 2020 (hopefully January). You've probably noticed that we've been adding things like the mission statement and other development guide stuff. Our hope is to regrow the core developer community over the next year. Thanks for bearing with us. |
83eab5b
to
5dcf251
Compare
I just squashed, rebased on #4294, and reintroduced the recursive version of the LCS algorithm. Once that gets merged the number of modified files in this PR will decrease from 16 to 12. @dschult My preferred way to handle different implementations of the same underlying algorithm is to have the core implementations named as protected functions in much the same way as you described. For this I have |
New work includes cleaning up the API, removing dead code, and addressing outstanding issues. Something I didn't expect to do, but it just sort of happened was I wrote an "auto-jit" utility for cython that should simply autocompile the pyx file if that's possible, otherwise it will fallback on pure-python. Essentially, all that would need to change is including the pyx file as a module resource in I also expanded the docs in several places with an emphasis on teaching about the algorithm. There are two more items I'd like to address before a full review:
Which brings me to a question: Is there any advice on where / how the rst docs should be updated for this new algorithm? Is there a changelog I need to fill out (or I believe that is generated via a git script). |
213a5de
to
3f21748
Compare
4f75b5e
to
b1ed6b2
Compare
b1ed6b2
to
a5c3299
Compare
I think I was able to figure out the docs well enough. I was able to compile them and look at them locally, and I think I handled most of the visual formatting issues. (I encountered more issues where xdoctest could help that, but its has to be at the sphinx level, so I think the next step is to integrate there). This PR now depends on two others #4294 and #4326. They are included in the git history here for dashboard purposes. I think they should be pretty easy to review, but this is the more interesting PR and comments/reviews on it would allow me to continue work on this while the other PRs finish a full review. I'm marking this PR as ready for review as the dashboards are passing and all components are in a state where I think they could go live. There is an outstanding question though: based on my ever-growing understanding, embeddings are minors. Should we change the name of the package to minors/tree_minor.py? How should this integrate with the existing algorithms/minors.py? My thought is that I could make Aside from that, I think this PR is in the done state. I don't have any plans to change anything pending review. Pinging @jarrodmillman @dschult just to make them aware of the status change. I understand this probably won't get fully reviewed/integrated until 2021. |
I made a version #4327 where I reorganized the minors subpackage. We can move forward with either this PR or the other one. |
I'm closing this in favor of #4327 I've made a few function name and signature tweaks such that it will just be simpler to work off the new PR than to backport those changes to this branch. |
Closed in favor of #4327
Summary
This PR is for two new (related) algorithms which I don't believe exist in networkx:
(will do this in a separate PR) Maximum ordered common subtree isomorphism: Given two ordered trees: G and H, find the largest subtree G' of G and H' of H where H' is isomorphic to G'.
Maximum ordered common subtree embedding: Given two ordered trees: G and H, find the largest embedding G' of G and H' of H where H' is isomorphic to G'. A tree G' is an embedded subtree of G if G' can be obtained from G by a series of edge contractions.
These are algorithms are restricted to ordered trees because --- at least the embedding one --- is APX-hard for any other relaxation of the input types. The subtree isomorphism probably is too, but I'd need to check.
Modivation
When working with pytorch-based neural networks I found that I often encounter a problem where I have some custom module that uses resnet50 as a component, and I would like to simply start from existing resnet50 pretrained weights. However, to do that the "state_dict" of the model must match the "state_dict" of the saved pytorch weights file. But because my resnet50 model is a component, the keys don't exactly line up.
For instance, the keys in the resnet file may look like this:
And the "model-state" for the module may look like this:
Now, yes I could do (and have done) a hacky solution that tries removing common prefixes, but I wanted something more general. Something that would "just work" in almost all cases. After thinking about it for awhile I realize that this was a graph problem. I can break up the components by the "." and make a directory like tree structure.
Now, I want to ask the question, what is the biggest subgraph that these two directory structures have in common. After searching for awhile I found the Maximum common induced subgraph problem, which is NP-hard for graphs. But we have trees, so can we do better? I found the Maximum Common
Subtree Isomorphism and Maximum Common Subtree but I wasn't able to follow any of these links to get an algorithm working. (although perhaps the first one is worth revisiting).
Eventually I found the paper: On the Maximum Common Embedded Subtree
Problem for Ordered Trees, which outlines a polynomial time algorithm for finding maximum common embedded subtrees. The paper was written in a way that I could follow it, but unfortunately I missed the detail that an embedding --- although similar to --- is not an isomorphic subgraph.
However, the algorithm for finding a Common Embedded Subtree does still work well for solving my problem. The paper also links to external resources where they do tackle the actual common subtree isomorphism problem, but unfortunately they were all behind paywalls.
However, I do believe I was able to modify the recurrence for the maximum common embedding into one that does produce a maximum common isomorphism, and I've empirically verified that it works on a few thousand randomly generated trees.
Remaining Work
So, I have these two networkx algorithms for
maximum_common_ordered_tree_embedding
andmaximum_common_ordered_subtree_isomorphism
, and I'd like to contribute them tonetworkx
itself. In their current state they are very messy and unpolished, but I'd want to know if the maintainers are interested in these algorithms before I do any code cleanup. If there is interest, any guidance on where these algorithms should be located would be appreciated (do these go in isomorphisms, somewhere else?).EDIT:
Current Summary of Modifications
I'll maintain a top-level summary of modifications in this comment.
networkx/algorithms/__init__.py
- expose string and embedding modulesnetworkx/algorithms/embedding/__init__.py
- new algorithm submodule for embedding problemsnetworkx/algorithms/embedding/tree_embedding.py
- ⭐ implements reduction from graph problem to string problem. Defines the function which is the main API-level contribution of this PR:maximum_common_ordered_tree_embedding
.networkx/algorithms/embedding/tests/test_tree_embedding.py
- associated tests for tree embeddingsnetworkx/algorithms/string/__init__.py
- new algorithm submodule for string problemsnetworkx/algorithms/string/balanced_sequence.py
- ⭐ core dynamic program to solve the maximum common balanced subsequence problem. This is the main algorithmic component of this PR.networkx/algorithms/string/balanced_sequence_cython.pyx
- optional, but faster cython version of balanced_sequence.pynetworkx/algorithms/string/tests/test_balanced_sequence.py
- associated tests for balanced sequencesexamples/applications/filesystem_embedding.py
- demonstrates how to solve the path embedding problem using tree embedding. This file likely needs further reorganization, or possibly a separate PR, the rest of the PR does stand alone without this.setup.py
- registered embedding and string subpackages.