Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algos for max ordered common subtree embedding #4169

Closed

Conversation

Erotemic
Copy link
Contributor

@Erotemic Erotemic commented Aug 17, 2020

Closed in favor of #4327

Summary

This PR is for two new (related) algorithms which I don't believe exist in networkx:

  • (will do this in a separate PR) Maximum ordered common subtree isomorphism: Given two ordered trees: G and H, find the largest subtree G' of G and H' of H where H' is isomorphic to G'.

  • Maximum ordered common subtree embedding: Given two ordered trees: G and H, find the largest embedding G' of G and H' of H where H' is isomorphic to G'. A tree G' is an embedded subtree of G if G' can be obtained from G by a series of edge contractions.

These are algorithms are restricted to ordered trees because --- at least the embedding one --- is APX-hard for any other relaxation of the input types. The subtree isomorphism probably is too, but I'd need to check.

Modivation

When working with pytorch-based neural networks I found that I often encounter a problem where I have some custom module that uses resnet50 as a component, and I would like to simply start from existing resnet50 pretrained weights. However, to do that the "state_dict" of the model must match the "state_dict" of the saved pytorch weights file. But because my resnet50 model is a component, the keys don't exactly line up.

For instance, the keys in the resnet file may look like this:

        >>> # This means you can load an off-the-shelf unmodified pretrained resnet50
        >>> # where the keys might look something like this:
        >>> resnet_keys = {
        >>>     'conv1.weight',
        >>>     'layer1.0.conv1.weight',
        >>>     'layer1.0.conv2.weight',
        >>>     'layer1.0.conv3.weight',
        >>>     'layer1.0.downsample.0.weight',
        >>>     'layer2.0.conv1.weight',
        >>>     'layer2.0.conv2.weight',
        >>>     'layer2.0.conv3.weight',
        >>>     'layer3.0.conv1.weight',
        >>>     'layer4.0.conv1.weight',
        >>>     'fc.weight',
        >>>     'fc.bias',
        >>> }

And the "model-state" for the module may look like this:

        >>> # And perhaps you have a model that has a state dict where keys
        >>> # look like this:
        >>> model_keys = {
        >>>     'preproc.conv1.weight'
        >>>     'backbone.layer1.0.conv1.weight',
        >>>     'backbone.layer1.0.conv2.weight',
        >>>     'backbone.layer1.0.conv3.weight',
        >>>     'backbone.layer1.0.downsample.0.weight',
        >>>     'backbone.layer2.0.conv1.weight',
        >>>     'backbone.layer2.0.conv2.weight',
        >>>     'backbone.layer2.0.conv3.weight',
        >>>     'backbone.layer3.0.conv1.weight',
        >>>     'backbone.layer4.0.conv1.weight',
        >>>     'head.conv1'
        >>>     'head.conv2'
        >>>     'head.fc.weight'
        >>>     'head.fc.bias'
        >>> }

Now, yes I could do (and have done) a hacky solution that tries removing common prefixes, but I wanted something more general. Something that would "just work" in almost all cases. After thinking about it for awhile I realize that this was a graph problem. I can break up the components by the "." and make a directory like tree structure.

Now, I want to ask the question, what is the biggest subgraph that these two directory structures have in common. After searching for awhile I found the Maximum common induced subgraph problem, which is NP-hard for graphs. But we have trees, so can we do better? I found the Maximum Common
Subtree Isomorphism
and Maximum Common Subtree but I wasn't able to follow any of these links to get an algorithm working. (although perhaps the first one is worth revisiting).

Eventually I found the paper: On the Maximum Common Embedded Subtree
Problem for Ordered Trees
, which outlines a polynomial time algorithm for finding maximum common embedded subtrees. The paper was written in a way that I could follow it, but unfortunately I missed the detail that an embedding --- although similar to --- is not an isomorphic subgraph.

However, the algorithm for finding a Common Embedded Subtree does still work well for solving my problem. The paper also links to external resources where they do tackle the actual common subtree isomorphism problem, but unfortunately they were all behind paywalls.

However, I do believe I was able to modify the recurrence for the maximum common embedding into one that does produce a maximum common isomorphism, and I've empirically verified that it works on a few thousand randomly generated trees.

Remaining Work

So, I have these two networkx algorithms for maximum_common_ordered_tree_embedding and maximum_common_ordered_subtree_isomorphism, and I'd like to contribute them to networkx itself. In their current state they are very messy and unpolished, but I'd want to know if the maintainers are interested in these algorithms before I do any code cleanup. If there is interest, any guidance on where these algorithms should be located would be appreciated (do these go in isomorphisms, somewhere else?).


EDIT:

Current Summary of Modifications

I'll maintain a top-level summary of modifications in this comment.

  1. networkx/algorithms/__init__.py - expose string and embedding modules

  2. networkx/algorithms/embedding/__init__.py - new algorithm submodule for embedding problems

  3. networkx/algorithms/embedding/tree_embedding.py - ⭐ implements reduction from graph problem to string problem. Defines the function which is the main API-level contribution of this PR: maximum_common_ordered_tree_embedding.

  4. networkx/algorithms/embedding/tests/test_tree_embedding.py - associated tests for tree embeddings

  5. networkx/algorithms/string/__init__.py - new algorithm submodule for string problems

  6. networkx/algorithms/string/balanced_sequence.py - ⭐ core dynamic program to solve the maximum common balanced subsequence problem. This is the main algorithmic component of this PR.

  7. networkx/algorithms/string/balanced_sequence_cython.pyx - optional, but faster cython version of balanced_sequence.py

  8. networkx/algorithms/string/tests/test_balanced_sequence.py - associated tests for balanced sequences

  9. examples/applications/filesystem_embedding.py - demonstrates how to solve the path embedding problem using tree embedding. This file likely needs further reorganization, or possibly a separate PR, the rest of the PR does stand alone without this.

  10. setup.py - registered embedding and string subpackages.

@pep8speaks
Copy link

pep8speaks commented Aug 17, 2020

Hello @Erotemic! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 14:80: E501 line too long (93 > 79 characters)
Line 25:80: E501 line too long (85 > 79 characters)
Line 96:80: E501 line too long (83 > 79 characters)
Line 99:80: E501 line too long (83 > 79 characters)
Line 325:80: E501 line too long (97 > 79 characters)
Line 425:80: E501 line too long (87 > 79 characters)
Line 428:80: E501 line too long (85 > 79 characters)
Line 446:80: E501 line too long (80 > 79 characters)
Line 453:80: E501 line too long (80 > 79 characters)
Line 490:80: E501 line too long (80 > 79 characters)
Line 497:80: E501 line too long (80 > 79 characters)
Line 540:80: E501 line too long (85 > 79 characters)
Line 554:80: E501 line too long (81 > 79 characters)
Line 737:80: E501 line too long (84 > 79 characters)
Line 746:80: E501 line too long (83 > 79 characters)
Line 747:29: E203 whitespace before ':'
Line 785:80: E501 line too long (83 > 79 characters)
Line 792:80: E501 line too long (80 > 79 characters)

Line 36:80: E501 line too long (82 > 79 characters)
Line 46:80: E501 line too long (80 > 79 characters)
Line 85:80: E501 line too long (84 > 79 characters)
Line 91:80: E501 line too long (80 > 79 characters)
Line 99:80: E501 line too long (82 > 79 characters)

Line 127:80: E501 line too long (86 > 79 characters)
Line 129:80: E501 line too long (86 > 79 characters)

Line 16:80: E501 line too long (83 > 79 characters)
Line 53:80: E501 line too long (81 > 79 characters)
Line 59:80: E501 line too long (88 > 79 characters)
Line 63:80: E501 line too long (81 > 79 characters)
Line 73:80: E501 line too long (81 > 79 characters)
Line 146:80: E501 line too long (80 > 79 characters)
Line 373:80: E501 line too long (80 > 79 characters)
Line 414:80: E501 line too long (80 > 79 characters)
Line 424:80: E501 line too long (80 > 79 characters)
Line 441:80: E501 line too long (80 > 79 characters)
Line 475:43: E203 whitespace before ':'
Line 478:34: E203 whitespace before ':'
Line 564:43: E203 whitespace before ':'
Line 567:34: E203 whitespace before ':'

Line 36:80: E501 line too long (83 > 79 characters)
Line 39:80: E501 line too long (84 > 79 characters)

Line 75:80: E501 line too long (88 > 79 characters)

Comment last updated at 2020-10-12 20:36:26 UTC

@dschult
Copy link
Member

dschult commented Aug 17, 2020

Nice! There's a lot here. We would be interested. There is recent interest in tree isomorphism so that provides a place to put it too. Note: we already have monomorphisms and embeddings in the isomorphisms folder -- we may have to refactor some day, but today is not that day.

On obvious place to put it would be in algorithms/isomorphisms/tree_isomorphism.py. That module will need its docstring and __all__variable changed at the top and then add your code at the bottom. Tests would then be added to algorithms/isomorphisms/tests/test_tree_isomorphism.py

Alternatively, if it makes more sense to you or is easier (we can refactor if needed later) you can make a module algorithms/isomorphisms/subtree.py. Then a separate module for tests. Actually this way may be easier for us for now -- but slightly harder for you adding the hooks below.

Either way, put a hook into isomorphisms/__init__.py to add your functions to the namespace, and a hook in doc/reference/algorithms/isomorphism.rst to add your functions to the docs. It would be really cool if you have a use case to create an "example" in examples/advanced/XXX.py. These appear in the NetworkX Gallery part of the docs. Short example usage is good to have in your function docstring. The "examples" folder is good for slightly longer or more specific examples.

If you have questions, just ask. :)

@Erotemic
Copy link
Contributor Author

Awesome, I'll put some work into cleaning the code up.

Now that I'm more familiar with this problem I keep running into more an more places where this algorithm will be useful. Unfortunately, the current implementation I have seems too slow. It seems to only scale to graphs with < 500 nodes. I've spent a good deal of time profiling the code and optimizing where I can, but its unsatisfactory.

The running time of the algorithm is reported as O(n1 * n2 * min(d1, l1) * min(d2, l2)) with n1 and n2 nodes, of depth d1 and d2 and with l1 and l2 leaves, respectively. I ran tests on graphs width d1=1, d2=2, and n1=n2=n for n in [1, 10, 20, 50, 75, 100, 150, 200, 250, 300, 350, 400, 450, 500], and plotted a graph timing results. Ideally because d1 and d2 are constant, we will see a second order pattern emerge. I also fitted a 2nd, 3rd, and 4th degree polynomial to the data to check how good the fit is (obviously the higher order will always fit better, we we have to keep that in mind). I've graphed the results below

image

For reference the fit coefficients are:

coeff2 = np.array([ 1.969e-04, -6.714e-02,  3.647e+00], dtype=np.float64)
coeff3 = np.array([ 1.418e-07, -4.869e-06,  2.730e-03,  4.408e-02], dtype=np.float64)
coeff4 = np.array([ 3.167e-11,  7.989e-08,  3.273e-05, -4.573e-03,  2.655e-01], dtype=np.float64)

I'm not sure if the second order is fitting well enough. It does seem to imply that I'm getting the correct running time in terms of graph sizes. So perhaps my constant overhead is too big. Currently the algorithm is using a recursive call with a memoize dictionary. I don't see an obvious way to transform it into an iterative scheme, but if there is a way to do that, then that should help the runtime by a lot by reducing the number of python function calls that need to be made.

For this to be really useful it should be able to scale to graphs with 10s of thousands of nodes. Current projections estimate that for 10,000 nodes that could take 5, 39, or 111 hours, which is far too slow.

What is the policy on cython backend implementations for this repo? I think if the algo gets a 100x speedup from a compiled language it will be real-world useful (assuming its the Python calls that are the real problem).

If anyone has other ideas on how to speed it up, I'd be happy to hear them.

@dschult
Copy link
Member

dschult commented Aug 19, 2020

Recursive code is good for proofs, not so good for performance. All recursive code can be written without recursion -- the idea is to replace the recursive function call with a stack to hold the state and then make a loop that runs the function and moves things on or off the stack when the recursive function would have been called or returned from. Easy to say, often quite head-bending to implement. But it does speed things up quite a bit.

You say you've done some profiling... What are the bottlenecks?
I think that cython code will work well in cases where the type of the variables is known and low-level and there isn't much translation between low-level representation and higher level representation.

Finally, run-time asymptotics is 1) valid in a limit so sometimes not accurate for real-world sized "n" and 2) sometimes easy to mix up when coding -- putting a seemingly innocent idiom that bumps up the exponent in the polynomial. To me, the data looks faster than quadratic... but that's just using my "eye-ball norm" :}

I haven't looked through the code in detail, but a first pass suggests that you spend a lot of code packing and unpacking tuples of values... In Python calling functions is often a bottleneck (in my experience) and packing and unpacking tuples is similar (the args to a function are packed to make the call and then unpacked to start the function.

@Erotemic
Copy link
Contributor Author

For some reason I thought there was a case where it was only possible to perform a task without recursion, but now that I'm refreshing myself on the details, I think I was confusing that with the result where tail-recursive programs can be rewritten as loops without stacks.

I always struggled with the bit of dynamic programming where you transformed the recursive function into a iterative one, but I'm sure I can work through it and it will be a good exercise for me. I've seen the speed benefit of doing so, so it will be interesting to see how much an impact it makes on this problem.


Given the above, I'm going to spend most of my effort working on the translation to an iterative algorithm, but I'll answer the other questions as well:

The algorithm works by translating the tree problem into a string/sequence problem. The conversion to and from these sequences is very fast. The tokens in the sequence don't actually need to be the same as the nodes in the tree, they just need to (1) represent them (2) have a way to denote if the token is an opening or closing token and (3) have a way to map back to the original Python node value. So, its actually reasonable amenable to translate the code into cython. The only caveat would be if the user wanted to supply a "node_affinity" function, in which case we would need to jump back into a python call. But for normal cases all of the python objects can technically be abstracted away.

Instead of using a python string I represent the sequence as a tuple of items (either strings, integers, or 2-tuples they all profile about the same). The reason for using tuples instead of lists to represent the sequences is that the sequences themselves need to be hashed for the memoization (I believe this will have to happen in the iterative algorithm as well). At one time I thought I could enumerate all possible sequences that the algorithm would end up needing (which was actually my first attempt at turning this into an iterative algorithm), but it turned out I missed a case and I gave up on that route, perhaps its worth taking a second look at it.

Here is the output of the line_profiler for the recursive call. The main bottleneck is in the hashing of seq1 and seq2. Originally, its dictionary lookups involving seq1 and seq2, but calling and then using that integer as a key is measurably faster.

Function: _lcs at line 647

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   647                                           @profile
   648                                           def _lcs(seq1, seq2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo):
   649     80481      55118.0      0.7      3.6      if not seq1:
   650      5577       3929.0      0.7      0.3          return (seq1, seq1), 0
   651     74904      47199.0      0.6      3.1      elif not seq2:
   652      1201        886.0      0.7      0.1          return (seq2, seq2), 0
   653                                               else:
   654                                                   # if len(seq2) < len(seq1):
   655                                                   #     seq1, seq2 = seq2, seq1
   656                                                   # key = (seq1, seq2)
   657     73703     276552.0      3.8     18.2          key1 = hash(seq1)  # using hash(seq) is faster than seq itself
   658     73703     304711.0      4.1     20.0          key2 = hash(seq2)
   659     73703      60020.0      0.8      3.9          key = hash((key1, key2))
   660     73703      58327.0      0.8      3.8          if key in _memo:
   661     37960      28577.0      0.8      1.9              return _memo[key]
   662                                           
   666     35743      25909.0      0.7      1.7          if key1 in _seq_memo:
   667     35342      28331.0      0.8      1.9              a1, b1, head1, tail1, head1_tail1 = _seq_memo[key1]
   668                                                   else:
   669       401      12308.0     30.7      0.8              a1, b1, head1, tail1 = balanced_decomp_unsafe2(seq1, open_to_close)
   670       401        390.0      1.0      0.0              head1_tail1 = head1 + tail1
   671       401        398.0      1.0      0.0              _seq_memo[key1] = a1, b1, head1, tail1, head1_tail1
   672                                           
   673     35743      26558.0      0.7      1.7          if key2 in _seq_memo:
   674     35340      28794.0      0.8      1.9              a2, b2, head2, tail2, head2_tail2 = _seq_memo[key2]
   675                                                   else:
   676       403      13014.0     32.3      0.9              a2, b2, head2, tail2 = balanced_decomp_unsafe2(seq2, open_to_close)
   677       403        402.0      1.0      0.0              head2_tail2 = head2 + tail2
   678       403        382.0      0.9      0.0              _seq_memo[key2] = a2, b2, head2, tail2, head2_tail2
   679                                           
   680                                                   # Case 2: The current edge in sequence1 is deleted
   681     35743      61089.0      1.7      4.0          best, val = _lcs(head1_tail1, seq2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
   682                                           
   683                                                   # Case 3: The current edge in sequence2 is deleted
   684     35743      55926.0      1.6      3.7          cand, val_alt = _lcs(seq1, head2_tail2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
   685     35743      28498.0      0.8      1.9          if val_alt > val:
   686     17386      12013.0      0.7      0.8              best = cand
   687     17386      11350.0      0.7      0.7              val = val_alt
   688                                           
   689                                                   # Case 1: The LCS involves this edge
   690     35743      30563.0      0.9      2.0          t1 = open_to_tok[a1[0]]
   691     35743      30056.0      0.8      2.0          t2 = open_to_tok[a2[0]]
   693     35743     174729.0      4.9     11.5          affinity = node_affinity(t1, t2)
   694     35743      26802.0      0.7      1.8          if affinity:
   695      4497       7392.0      1.6      0.5              new_heads, pval_h = _lcs(head1, head2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
   696      4497       6953.0      1.5      0.5              new_tails, pval_t = _lcs(tail1, tail2, open_to_close, node_affinity, open_to_tok, _memo, _seq_memo)
   697                                           
   698      4497       3597.0      0.8      0.2              new_head1, new_head2 = new_heads
   699      4497       3188.0      0.7      0.2              new_tail1, new_tail2 = new_tails
   700                                           
   701      4497       5169.0      1.1      0.3              subseq1 = a1 + new_head1 + b1 + new_tail1
   702      4497       4671.0      1.0      0.3              subseq2 = a2 + new_head2 + b2 + new_tail2
   703                                           
   704      4497       3249.0      0.7      0.2              cand = (subseq1, subseq2)
   705      4497       3469.0      0.8      0.2              val_alt = pval_h + pval_t + affinity
   706      4497       3380.0      0.8      0.2              if val_alt > val:
   707       919        667.0      0.7      0.0                  best = cand
   708       919        606.0      0.7      0.0                  val = val_alt
   709                                           
   710     35743      26093.0      0.7      1.7          found = (best, val)
   711     35743      29843.0      0.8      2.0          _memo[key] = found
   712     35743      22348.0      0.6      1.5          return found

I'm pretty sure I've squeezed all the performance possible out of the recursive definition, so lets see what a iterative implementation looks like. I'm changing the title of this PR to [WIP].

@Erotemic Erotemic changed the title Algos for max ordered common subtree embedding/isomorphism [WIP] Algos for max ordered common subtree embedding/isomorphism Aug 19, 2020
@jarrodmillman jarrodmillman marked this pull request as draft August 19, 2020 22:28
@jarrodmillman
Copy link
Member

I marked this as a draft and removed WIP, which is easier for us to track the WIP status. Since WIP isn't hooked into GitHub, we used to merge and forget to remove WIP.

@jarrodmillman jarrodmillman changed the title [WIP] Algos for max ordered common subtree embedding/isomorphism Algos for max ordered common subtree embedding/isomorphism Aug 19, 2020
@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 21, 2020

I created a basic iterative algorithm, which was less difficult than I thought it was going to be. I'm not sure if there is better iteration strategy, but I basically simulated a simplified version of the recursive callstack.

Depending on the test I run, it is actually slower than the recursive algorithm (for random graphs), but the iterative seems faster on larger shallow graphs. I did get one nice results where the recursive algorithm takes 100 seconds and the iterative algorithm takes 6 seconds on two fairly complex depth-10 200 node trees, so it does look like the change was worth it.

I've tried several variants of the iterative algorithm, the speed of each seems to depend on the input type. I'll need to run a few more tests to characterize when the each algorithm is slow. Still a bit more work to do, but its coming along.

@Erotemic
Copy link
Contributor Author

Spent a little time working on characterization when the algorithm is slow / fast, but just as a quick update: Cython helps a lot. I'm getting 40-50x speedups just by using almost the exact same python code in Cython. Removing the python overhead for simple loop iteration / continue statements / variable assignment / etc. seems to go a long way.

Here are a few timing comparisons:

With one type of graph:

Timed best=0.257 s, mean=0.257 ± 0.0 s for size=50,impl='iter-prehash2-cython',mode='number'
Timed best=9.653 s, mean=9.653 ± 0.0 s for size=50,impl='iter-prehash2',mode='number'
---
Timed best=3.098 s, mean=3.098 ± 0.0 s for size=100,impl='iter-prehash2-cython',mode='number'
Timed best=122.768 s, mean=122.768 ± 0.0 s for size=100,impl='iter-prehash2',mode='number'

With another:

Timed best=0.035 s, mean=0.035 ± 0.0 s for max_depth=5,size=50,impl='iter-prehash2-cython',mode='number'
Timed best=1.115 s, mean=1.115 ± 0.0 s for max_depth=5,size=50,impl='iter-prehash2',mode='number'
Timed best=0.547 s, mean=0.547 ± 0.0 s for max_depth=5,size=50,impl='recurse',mode='number'
---
Timed best=0.093 s, mean=0.093 ± 0.0 s for max_depth=7,size=50,impl='iter-prehash2-cython',mode='number'
Timed best=3.145 s, mean=3.145 ± 0.0 s for max_depth=7,size=50,impl='iter-prehash2',mode='number'
Timed best=1.603 s, mean=1.603 ± 0.0 s for max_depth=7,size=50,impl='recurse',mode='number'
---
Timed best=0.193 s, mean=0.193 ± 0.0 s for max_depth=9,size=50,impl='iter-prehash2-cython',mode='number'
Timed best=6.831 s, mean=6.831 ± 0.0 s for max_depth=9,size=50,impl='iter-prehash2',mode='number'
Timed best=3.476 s, mean=3.476 ± 0.0 s for max_depth=9,size=50,impl='recurse',mode='number'
---
Timed best=0.004 s, mean=0.004 ± 0.0 s for max_depth=1,size=100,impl='iter-prehash2-cython',mode='number'
Timed best=0.064 s, mean=0.064 ± 0.0 s for max_depth=1,size=100,impl='iter-prehash2',mode='number'
Timed best=0.032 s, mean=0.032 ± 0.0 s for max_depth=1,size=100,impl='recurse',mode='number'
---
Timed best=0.041 s, mean=0.041 ± 0.0 s for max_depth=3,size=100,impl='iter-prehash2-cython',mode='number'
Timed best=1.339 s, mean=1.339 ± 0.0 s for max_depth=3,size=100,impl='iter-prehash2',mode='number'
Timed best=0.680 s, mean=0.680 ± 0.0 s for max_depth=3,size=100,impl='recurse',mode='number'
---
Timed best=0.143 s, mean=0.143 ± 0.0 s for max_depth=5,size=100,impl='iter-prehash2-cython',mode='number'
Timed best=5.102 s, mean=5.102 ± 0.0 s for max_depth=5,size=100,impl='iter-prehash2',mode='number'
Timed best=2.697 s, mean=2.697 ± 0.0 s for max_depth=5,size=100,impl='recurse',mode='number'
---
Timed best=0.348 s, mean=0.348 ± 0.0 s for max_depth=7,size=100,impl='iter-prehash2-cython',mode='number'
Timed best=12.108 s, mean=12.108 ± 0.0 s for max_depth=7,size=100,impl='iter-prehash2',mode='number'
Timed best=6.395 s, mean=6.395 ± 0.0 s for max_depth=7,size=100,impl='recurse',mode='number'
---
Timed best=0.574 s, mean=0.574 ± 0.0 s for max_depth=9,size=100,impl='iter-prehash2-cython',mode='number'
Timed best=21.354 s, mean=21.354 ± 0.0 s for max_depth=9,size=100,impl='iter-prehash2',mode='number'
Timed best=11.408 s, mean=11.408 ± 0.0 s for max_depth=9,size=100,impl='recurse',mode='number'

In a lot of cases the recursive algorithm is still winning over the iterative algorithm, but the Cython-iterative algorithm blows them all away.

@Erotemic
Copy link
Contributor Author

I think most of the engineering work on the algorithm is done. I've started a cleanup of the code. I've moved relevant files to a subfolder _embeddinghelpers in order to organize the code while its getting reorganized.

(Side effect of this process is I wrote a nifty script for conversion from google-style docstrings to numpy-style docstrings.)

The next major step is to write all the unit-tests (many of which can be created from existing doctests), but at this stage the algorithm is technically usable.

Any input or guidance on code structure as time allows would be appreciated.

Main questions are:

  • I currently have a module for demodata, should that be consolidated into another package?
  • I have a module for the core tree_embedding problem and a module for the balanced_sequence subproblem,. How should those files be organized?
  • In balanced_sequence there are several backends in pure python for several iterative and recursive variants as well as iterative cython implementation (which gives a 40-50x speed boost, I imagine a pure-c implementation would be even better), which is those should be exposed / kept / removed / refactored?
  • I have a file path_embedding, which actually uses the networkx algorithm as a method to solve a related path problem. While this isn't a graph-based function, its still useful, and it does make writing and visualizing tests much easier. Does it belong in networkx at all? Could it be included for testing?
  • There is a benchmarks.py file where I want to consolidate benchmarks. Does that live in the networkx repo? Or does that go in my misc git repo?

@Erotemic
Copy link
Contributor Author

Erotemic commented Aug 29, 2020

The current state of this branch is getting close to completion. I've decided to restrict this to only the embedding variant of the problem, and I'll submit the isomorphism part in a separate PR, this one big enough already, but I think we can slim it down some. However, there are parts that are nearly-ready or ready for review, so it might be a good time for me to ask for some feedback. I'll post specific questions later but for now, here's roughly the state of things.

I've put everything into an _embedding submodule that we can move to the appropriate place later. In that there is an __init__.py that gives an overview of the module layout, notes outstanding issues, and defines the user-facing API.

The other files are as follows;

balanced_sequence.py - core python implementations for the longest common
balanced sequence subproblem.

balanced_sequence_cython.pyx -
faster alternative implementsions for balanced_sequence.py

tree_embedding.py - defines reduction from tree problem to balanced sequence
problems.

path_embedding.py - defines reduction from path problem to tree problem (not
core, this is just useful for testing among other things).

demodata.py - Contains data for docstrings, benchmarks, and synthetic problems

One of the first decisions I need help making is which variants of the algorithm are worth keeping. Currently I have 8 (or 8 * 2 if you count the way I'm doing the reduction as a variable).

Tests on reasonbly-sized "large datasets" showed the following times:

        # runparam_to_time = {
        #     ('chr', 'iter-alt2-cython')    : {'mean': 0.282, 'max': 0.923},
        #     ('chr', 'recurse')             : {'mean': 0.397, 'max': 1.297},
        #     ('chr', 'iter-alt2')           : {'mean': 0.409, 'max': 1.328},
        #     ('chr', 'iter-alt1')           : {'mean': 0.438, 'max': 1.428},
        #     ('chr', 'iter-prehash2-cython'): {'mean': 0.511, 'max': 1.668},
        #     ('chr', 'iter')                : {'mean': 0.580, 'max': 1.915},
        #     ('chr', 'iter-prehash2')       : {'mean': 0.605, 'max': 1.962},
        #     ('chr', 'iter-prehash')        : {'mean': 0.679, 'max': 2.211},
        # }

where the second item in the "key" tuple is the implementation codename. One of the cython algorithms is performing the best, but the recursive algorithm is the next best. However, while the fastest, the recursive algorithm cannot handle trees beyond a certain depth, so the "iter-alt2" implementation seems like the one to keep in this case.

In a second tests I do vary my sequence-encoding-strategry;

        # runparam_to_time = {
        #     ('chr', 'iter-alt2-cython')       : {'mean': 0.036, 'max': 0.094},
        #     ('chr', 'iter-alt2')              : {'mean': 0.049, 'max': 0.125},
        #     ('chr', 'iter-alt1')              : {'mean': 0.050, 'max': 0.129},
        #     ('chr', 'iter-prehash2-cython')   : {'mean': 0.057, 'max': 0.146},
        #     ('number', 'iter-prehash2-cython'): {'mean': 0.057, 'max': 0.146},
        #     ('chr', 'iter')                   : {'mean': 0.064, 'max': 0.167},
        #     ('chr', 'iter-prehash2')          : {'mean': 0.066, 'max': 0.170},
        #     ('number', 'iter-prehash2')       : {'mean': 0.067, 'max': 0.176},
        #     ('chr', 'iter-prehash')           : {'mean': 0.073, 'max': 0.187},
        #     ('number', 'iter-prehash')        : {'mean': 0.074, 'max': 0.196},
        #     ('number', 'iter-alt1')           : {'mean': 0.126, 'max': 0.344},
        #     ('number', 'iter-alt2-cython')    : {'mean': 0.133, 'max': 0.363},
        #     ('number', 'iter')                : {'mean': 0.140, 'max': 0.386},
        #     ('number', 'iter-alt2')           : {'mean': 0.149, 'max': 0.408},
        # }

chr uses a string-based sequence of utf8 chars, whereas number uses a list of numbers to represent the sequence (I also tried a list of tuples but it was far worse than these methods). The chr method does seem a good deal faster, but it can only handle a maximum of 556,032 distinct nodes, so maybe that's good enough, but number would be able to scale to any number of nodes, so I don't know which one to be using as default. (for reference we want to encode a balanced sequence so we need to map each node to an opening and closing "token", so the number strategy uses n and -n and the chr strat uses chr(n * 2) and chr(n * 2 + 1) (where n >= 1). Should networkx assume it can never handle graphs with > 5e5 nodes, which means just use chr, or should it always be able to "do the right thing no matter the cost" and use number? Or I suppose I could just do a check on the size and use "chr" 99.9% of the cases and allow the user to shoot themselves in the foot if they ever call this algo with graphs that have over half-a-million nodes?

@dschult @jarrodmillman: List of things to decide are:

  • Cython algorithm: remove / keep+always use / keep+optionally use # I would vote for the last one here
  • What is the default backend "impl", which of the existing options do we keep?
  • What is the default sequence encoding "mode", which of the existing options do we keep?

There are still small improvements I still need to take care of, but having answers to these questions would help me complete that work.

@jarrodmillman
Copy link
Member

We currently don't use Cython and will need to make a decision whether we will start before merging this. Historically, our focus has been on simple implementations over performance. But, now is a good time for us to revisit this.

We have plenty of time before the next release, so let's leave it in the PR until we revisit whether or not to change our Cython policy. We may need to create a NXEP to decide what we are going to do about optimizing code for performance. Thanks for your patience.

@dschult
Copy link
Member

dschult commented Sep 12, 2020

This is a really big PR. It's likely to take a long time to review. We can proceed in a couple of ways. Either split up the PR into many (ugh) or split up review into parts (maybe also ugh). I'd like to consider the heart of the new algorithm, but it's hard to know where the boundary of that code is. Which parts are interlinked and which could stand separately? It doesn't have to be split into multiple github PRs, but it might be helpful to know where the interdependent algorithm code is.

My guess is that balanced_sequence is the base code of this PR.
It looks like balanced_embedding is needed for tree_sequence but not the other direction.
Same for path_embedding. Also, path_embedding and tree_sequence are independent...???
The demodata is used in some of the tests, but isn't crucial to make simple tests.
The benchmarks are not depended upon elsewhere.

What do you think about putting benchmarks, and maybe demodata into the /examples/ folder?
It wouldn't take much to make them into an example, right? Mostly adding documentation explaining what they do. But I might be missing something -- I didn't look into that closely.

Maybe we can start with balanced_sequence. There are many similar functions in there... Are they doing the same thing different ways? Or is it similar things in similar ways? What's the story there?

@Erotemic
Copy link
Contributor Author

It is bigger than I had hoped it would be, but it was tricky to implement efficiently as the algorithm is based on a reduction to a string problem, which means there has to be a conversion layer to and from the nx.Graph to some "string" (i.e. Indexable) Python data structure. My original version was much smaller, but also orders of magnitude times slower and prone to stack overflow.

Your understanding of the dependencies is correct. The core of the algorithmic component is balanced_sequence, this handles the string-problem the tree-problem is reduced to. There are several different implementations of the same core logic (which are the protected _lcs_* functions), and there is a public member longest_common_balanced_sequence that provides parameterized access these different implementations (via the impl argument). This is the file that needs algorithmic review, all other files are just interfaces to this problem, and only need an API-level review.

On the core balance_sequence module

The core groups of functions (which may have different implementations) in balanced_sequence are:

  • Helper data structures

    • UnbalancedException - raised if input is not balanced
    • IdentityDict - syntactic sugar for obj[x] = x
  • public API

    • longest_common_balanced_sequence - access algorithm
    • available_impls_longest_common_balanced_sequence - check which impls are available
  • Core Algorithm

    • Different LCS functions (core-dynamic-program)

      • _cython_lcs_backend - returns the cython module if available
      • _lcs_recurse - original recursive implementation (remove? lose ability to benchmark, faster in some cases)
      • _lcs_iter_simple - original iterative implementation (remove? - not as good as alt2)
      • _lcs_iter_simple_alt1 (remove? - not as good as alt2)
      • _lcs_iter_simple_alt2 (keep? - best simple python implementation - uses depth first-like iteration pattern
      • _lcs_iter_prehash - (remove? see prehash discussion)
      • _lcs_iter_prehash2 - (remove? see prehash discussion)
    • Balanced Decomposition - Decompose a balanced sequence into components: head, tail, head|tail

      • balanced_decomp - checks that the input is a balanced sequence
      • balanced_decomp_prehash - faster method when balanced sequence data structure is a list of tuples (*see prehash discussion)
      • balanced_decomp_unsafe - assumes the input is balanced to avoid checks for speed
    • Generate Balance: Iterates through a balanced sequence and marks locations of balance (used in the decomposition)

      • generate_balance : function
      • generate_balance_unsafe : function
    • Generate All Decom - recursively divides a balanced sequence into components.

      • generate_all_decomp -
      • generate_all_decomp_prehash - prehash variant

There is also a Cython version of balanced_sequence (named balanced_sequence_cython), which implements more variations of solving the "_lcs" problem. I wrote the code in such a way that the public API's default behavior is to use cython if available, but fallback to pure python. It is the fastest algorithm by a significant margin, so I do recommend considering including it.

On the prehash behavior

There is a lot of redundant behavior in this file, and it might be able to be reduced, depending on how we want to proceed. The "prehash" stuff above is mainly useful when balanced sequences are stored as a list of tuples. The main advantage of this representation is the quick and easy-to-see access to the internals. However, when the data structure is a python str or a list[int], then prehashing does not seem to benchmark better than the other methods. Thus we could restrict the balanced sequence to only be str|List[int] and not include List[Tuple] and remove the prehash methods.

I would have done that already, but I had to benchmark it to know what combination of data structure + logic was going to work best, and I wanted to ensure that some hash in the git history had the code so that experiment I did can be reproduced. So I wanted to note that it exists, but now that I have I'd advocate for removing it because it wont be useful in practice, but I also wouldn't object to keeping it in as an option as it may aid future development.

On non-core components

-> means depends on

path_embedding -> tree_embedding -> balanced_sequence

demodata enables both benchmarks and tests, I think some of the functions in demodata may be suited elsewhere: I was surprised there wasn't already a function like random_ordered_tree with a create_using arg that I could use to make a random OrderedDiGraph. The random_balanced_sequence and random_paths function may be able to find homes elsewhere though.

I think putting benchmarks in an examples folder is a good idea. I think it might also be a good idea to put path_embedding there (because a path_embedding is really just a use-case of the graph algorithm) as it doesn't make much sense to offer that functionality in a graph library.

On Cython

For reasonably stable algorithms it makes sense to have a faster Cython implementation if networkx would like to improve its scalability. However, I understand the maintenance and build-time costs. A module becomes much harder to maintain once you need to worry about binaries. I've had a lot of experience with it via the kwimage, pyhesaff, and pyflann packages. Fortunately scikit-build does make it somewhat simpler to manage, but I know there is an outstanding pypy issue.

I think it always makes sense to never rely on a Cython implementation, but I think it should be able to be used it if exists. Thus a Cython algorithm should always reproduce some functionality available in the pure-python realm as I've done here.

My recommendation is to include the Cython implementation in this PR, but do nothing else in the time being. If the user wishes to use it, they can compile it themselves as it will be distributed with networkx. When the scikit-build pypy issue gets fixed we can talk about building manylinux, osx, and win32 wheels on CIs (we can also talk about auto-publishing and auto-GPG signing on CI if you are interested, not sure what your current release workflows are). In the unlikely case where it falls out of sync with main python it can be removed.

@Erotemic
Copy link
Contributor Author

Erotemic commented Oct 11, 2020

@dschult @jarrodmillman: In lieu of feedback (which is understandable, I understand time constraints wrt working on OSS), I made a few decisions which will hopefully make review easier. I tended to err on the side of removing things.

Current Summary of Modifications

  1. networkx/algorithms/__init__.py - expose string and embedding modules

  2. networkx/algorithms/embedding/__init__.py - new algorithm submodule for embedding problems

  3. networkx/algorithms/embedding/tree_embedding.py - ⭐ implements reduction from graph problem to string problem. Defines the function which is the main API-level contribution of this PR: maximum_common_ordered_tree_embedding.

  4. networkx/algorithms/embedding/tests/test_tree_embedding.py - associated tests for tree embeddings

  5. networkx/algorithms/string/__init__.py - new algorithm submodule for string problems

  6. networkx/algorithms/string/balanced_sequence.py - ⭐ core dynamic program to solve the maximum common balanced subsequence problem. This is the main algorithmic component of this PR.

  7. networkx/algorithms/string/balanced_sequence_cython.pyx - optional, but faster cython version of balanced_sequence.py

  8. networkx/algorithms/string/tests/test_balanced_sequence.py - associated tests for balanced sequences

  9. networkx/generators/random_graphs.py - minor change to add a new function: random_ordered_tree

  10. examples/applications/filesystem_embedding.py - demonstrates how to solve the path embedding problem using tree embedding. This file likely needs further reorganization, or possibly a separate PR, the rest of the PR does stand alone without this.

Details

More details on the changes I made to arrive here:

  • I removed all but two algorithm backend implementations for lcs: the best iterative Python and Cython implementations. This means using "recursive" and the other alternative are not longer an options in the benchmarks. However, the API is simpler the only algorithm duplication is between python and cython. If we do want any of these alternative implementations, I can revert these changes.

  • I kept in the cython algorithm as an optional component. I can remove it if necessary. The pure-python algorithm does now operate at reasonable speeds, but Cython is noticeably faster.

  • I removed the "tuple" option for representing balanced sequences. Now only chr and number are available. I also cleaned up the signature for related functions.

  • Based on @dschult's suggestion I moved all of the path embedding logic into an example file: examples/applications/filesystem_embedding.py I moved all the functions, benchmarks, and associated tests into that file as well. I'm open to further refactoring ideas for this.

  • I also split the contribution into two separate algorithm modules: string and embedding. This means:

    • networkx.algorithms.string contains the core balanced sequence problems, and tests, which should make it simpler to review.
    • networkx.algorithms.embedding has the tree_embedding module, which contains the main purpose of this PR: the new maximum_common_ordered_tree_embedding function.
  • I moved the random_ordered_tree function I use in my tests and examples into networkx.generators/random_graphs. I think that location makes the most sense for that function.

Questions I have about the current state of the code:

  • does the function forest_str in networkx/algorithms/embedding/tree_embedding.py belong anywhere else? Thoughts on that function?

@Erotemic Erotemic force-pushed the dev/ordered_subtree_embeddings branch from c86a231 to 1277475 Compare October 11, 2020 01:26
@dschult
Copy link
Member

dschult commented Oct 12, 2020

I almost wonder if forest_str should start a module called networkx/drawing/ascii.py (or utf8, I guess ascii is technically incorrect, but its the common name for these kinds of drawings. :)

Also, this PR uses xdoctest for doctests. Can you make it work with pytest --doctest-modules?

@Erotemic
Copy link
Contributor Author

Erotemic commented Oct 12, 2020

I like having a separate module for forest_str. How about networkx/drawing/text.py? If not utf8.py is fine.

I've edited the main post to include the summary of all module changes, which I will keep up-to-date as this PR continues.

As for xdoctest, I would really prefer to continue to use it. However, I do understand the desire to minimize dependencies, so my most recent commit does rework my doctests to be backwards-compatible (at the cost of some readability and soontobementioned issues). That being said, I do hope you consider including xdoctest the future. I do think it would be an overall improvement, and it would make writing / maintaining doctests much easier.

The case for xdoctest in networkx

The main feature I'm making use of is "new-style got/want" tests, where print statements don't need to be broken up. For instance, the following works with xdoctest:

    Example
    -------
    >>> open_to_close = {'{': '}', '(': ')', '[': ']'}
    >>> seq = '({[[]]})[[][]]{{}}'
    >>> all_decomp = generate_all_decomp(seq, open_to_close)
    >>> node, *decomp = all_decomp[seq]
    >>> pop_open, pop_close, head, tail, head_tail = decomp
    >>> print('node = {!r}'.format(node))
    >>> print('pop_open = {!r}'.format(pop_open))
    >>> print('pop_close = {!r}'.format(pop_close))
    >>> print('head = {!r}'.format(head))
    >>> print('tail = {!r}'.format(tail))
    >>> print('head_tail = {!r}'.format(head_tail))
    node = '('
    pop_open = '('
    pop_close = ')'
    head = '{[[]]}'
    tail = '[[][]]{{}}'
    head_tail = '{[[]]}[[][]]{{}}'

However, to refactor that to work with the builtin doctest module would require following each print statement with the text it produced:

    Example
    -------
    >>> open_to_close = {'{': '}', '(': ')', '[': ']'}
    >>> seq = '({[[]]})[[][]]{{}}'
    >>> all_decomp = generate_all_decomp(seq, open_to_close)
    >>> node, *decomp = all_decomp[seq]
    >>> pop_open, pop_close, head, tail, head_tail = decomp
    >>> print('node = {!r}'.format(node))
    node = '('
    >>> print('pop_open = {!r}'.format(pop_open))
    pop_open = '('
    >>> print('pop_close = {!r}'.format(pop_close))
    pop_close = ')'
    >>> print('head = {!r}'.format(head))
    head = '{[[]]}'
    >>> print('tail = {!r}'.format(tail))
    tail = '[[][]]{{}}'
    >>> print('head_tail = {!r}'.format(head_tail))
    head_tail = '{[[]]}[[][]]{{}}'

Personally, I think the former is far more readable (you see a block of code that produces something and then you see the output as one single chunk, versus forcing humans to work like a REPL).

There are also minor issues with trailing whitespace causing got/want errors in the original doctest, the fact that any non-captured variable must provided with a "want" string, and the general issue of forcing the programmer to distinguish between lines that start a statement versus are continuations of previous statements.

For reference xdoctest is small, has no minimal dependencies, and is 100% compatible with the current structure of networks (in fact networkx is one of the main test-cases I used to ensure backwards compatibility when I wrote xdoctest), running pytest --xdoctest-modules --xdoctest-global-exec "import networkx as nx" will run all of the existing tests correctly.

@dschult
Copy link
Member

dschult commented Oct 12, 2020

Thanks for that description of xdoctest.

And I think readwrite/text.py is a good name.

@Erotemic Erotemic changed the title Algos for max ordered common subtree embedding/isomorphism Algos for max ordered common subtree embedding Oct 12, 2020
@jarrodmillman
Copy link
Member

I am not sure I like having the doctests behave differently than Python. Is this a widely used package? What other projects use it?

@Erotemic
Copy link
Contributor Author

Erotemic commented Oct 27, 2020

@jarrodmillman In case I haven't mentioned, I am the author of xdoctest. I've used xdoctest in every project I've worked on since I originally wrote it including ubelt (10k downloads / month) and ibeis (fewer users but it is a highly domain specific library). For projects where I'm not the author, the largest package xdoctest is used by is mmdetection, and most of the 526 packages listed as github dependents are forks of mmdet. Other notable users are imgaug, xclim, and pgmpy. I think there are others, but I'd have to sift through that list. Xdoctest is also featured in Chapter 5 of hypermodern python and I presented a talk about it at PyCon2020.

On the issue of doctests behaving differently than Python's builtin doctest module, the module is almost entirely backwards compatible (I recently patched a few more corner cases that popped up after I gave the PyConn talk, and I'll patch more if I find them), its really more of an extension of Python's builtin system that allows for more flexibility when writing tests (the biggest benefit being that you don't have to manually distinguish between lines starting with >>> and ... , just prefix everything with >>> and xdoctest will handle determining if there is a continuation or not).

@jarrodmillman
Copy link
Member

I am not so concerned with it not behaving like the doctest module, but that it isn't how the Python interpreter works. These are supposed to be little snippets of how the code behaves at that commandline. In your example above, the output of the print statements doesn't show up after the command, but after all the print statements. I am also -1 on using >>> when Python should have ...

@jarrodmillman
Copy link
Member

I also like the name readwrite/text.py.

@Erotemic
Copy link
Contributor Author

Erotemic commented Oct 28, 2020

that it isn't how the Python interpreter works

I'm confused here. The Python interpreter --- and more generally the Python language --- doesn't require or even say anything about ... versus >>> . They are simply prefix characters that the interactive interpreter uses to denote when a new outer-most statement begins.

If you input

>>> if 1:
>>>     print('hi')

or

>>> if 1:
...     print('hi')

to the IPython interpreter, both of them work (passing to the regular python interpreter will actually fail because neither are actually valid python syntax). There is nothing special about the PS1 and PS2 prefix from Python's perspective.

What would be more accurate is to say xdoctest does not implement a REPL (read eval print loop) by default (although you can force it to use REPL mode where the output must come after the expression that generated it). Instead, xdoctest behaves more like Jupyter cells, and checks that all of the output produced by a cell matches the expected output.

Really, the only reason that the builtin doctest forces use of the ... is because it relies on regular expressions (the re module) to parse Python code (which actually doesn't make any mathematical sense). The xdoctest module instead uses the ast module and can actually understand the structure of the language, so it can determine when a statement starts and ends (which is actually critical for REPL mode).

These are supposed to be little snippets of how the code behaves at that commandline. In your example above, the output of the print statements doesn't show up after the command, but after all the print statements.

I don't see how any of xdoctest's features disagree with that premise. Imagine you see this snippet online:

if 1:
    print('hi')
if 1:
    print('hi')
    print('hi')
if 1:
    print('hi')

What you get when you paste it into an IPython terminal is

image

So doesn't it make more sense to use the xdoctest allowed way of specifying the doctest-syntax:

>>> if 1:
>>>     print('hi')
>>> if 1:
>>>     print('hi')
>>>     print('hi')
>>> if 1:
>>>     print('hi')
hi
hi
hi
hi

instead of writing it the only way you can write this with the builtin doctest module?

>>> if 1:
...     print('hi')
hi
>>> if 1:
...     print('hi')
...     print('hi')
hi
hi
>>> if 1:
...     print('hi')
hi

While only the previous case works in the original doctest, both cases work in xdoctest. By forcing you into the latter, it actually makes it harder for users to copy-paste snippet from autodoc generated docs. In the former I can copy one block and compare all my output versus what I see on the docs. In the second case I have to copy each expression over bit by bit. When copying a snippet, I generally want to be able to paste the whole thing.

However, this example does open up valid criticism: "Isn't there an ambiguity? How can you ensure that the second block did emit two 'hi's?". My response is: yes, there are rare corner cases where the checker is ambiguous. However, it is rare for these cases to emerge in the wild, and the large majority of failing outputs will be caught. Furthermore, if you are relying on a doctest for perfect testing of string matches, I would recommend you move that test to a proper unit test, because doctests are often messy and largely benefit from fuzzy matching. And if that doesn't convince you, then you can always put xdoctest into REPL mode and disable fuzzy matching to lose any ambiguity.

Perhaps I'm misunderstanding your comment, but I don't see how the arguments follow. Anyway, this seems like a discussion for another thread or issue. The changes in this PR no longer require xdoctest, so the point is moot here. However, I do think networkx stands to gain from including xdoctest, so I started an issue here #4295.


In news more relevant to this PR, I opened #4294 which ports the forest_str and random_ordered_tree to a new PR which will help divide and conquer the complexity of reviewing this PR.

Also, I noticed in the mission statement:

Sometimes we provide two algorithms for the same result, one using each data structure, when pedagogy or space/time trade-offs justify such multiplicity

So maybe that is justification for the recursive implementation of LCS to be included as an option? You can demonstrate how memoization can speed up recursive dynamic programs, but ultimately you hit a program stack limit error. It also provides an easier to understand algorithmic reference and by comparing the similarities between it and the iterative counterpart, it can be used to verify the iterative counterpart's correctness. Perhaps I should add that implementation back in as an available option?

I'm going to have some time off work in the next two weeks, so I want to spend a bit of time finishing up this PR --- ensuring docs exist, splitting any other part off as needed, etc... So be on the lookout for that in the next two weeks.

@dschult
Copy link
Member

dschult commented Oct 28, 2020

I'm +1 for including a recursive version as well as the non-recursive version. Perhaps called recursive_restofname or similar to easily identify it as the same function with different approach.

@jarrodmillman
Copy link
Member

jarrodmillman commented Oct 28, 2020

I agree that it is better to have this discussion separately. It isn't essential to this PR, which has a lot of good stuff. I also appreciate that you split this PR up to make it easier to review.

It is great that you will have time soon to work on finishing this up. I am looking forward to it.

FYI, I am going to be pretty busy for the rest of the year and I have a bunch of small items I want to make sure get included in the upcoming 2.6 release as well as spending most of my time working on my dissertation. So, if I am not able to keep up with your work, please don't take it as an indication that I am not interested. Our goal for the 2.6 and 3.0 releases is to remove technical debt and improve the PR review process. The goal is to start focusing more on exciting new features and algorithms after 3.0, which will be released in early 2020 (hopefully January). You've probably noticed that we've been adding things like the mission statement and other development guide stuff. Our hope is to regrow the core developer community over the next year. Thanks for bearing with us.

@Erotemic Erotemic force-pushed the dev/ordered_subtree_embeddings branch from 83eab5b to 5dcf251 Compare November 1, 2020 21:30
@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 1, 2020

I just squashed, rebased on #4294, and reintroduced the recursive version of the LCS algorithm. Once that gets merged the number of modified files in this PR will decrease from 16 to 12.

@dschult My preferred way to handle different implementations of the same underlying algorithm is to have the core implementations named as protected functions in much the same way as you described. For this I have _lcs_recurse and _lcs_iter. However, when it comes to exposing them in the API, I think it adds too much clutter to have different variants of the same algorithm take up more than one function in the main API. Instead I prefer to have a driver function (in this case longest_common_balanced_sequence) that takes a parameter impl (or something similar) and then calls one of the protected functions based on a string setting. This lets you define a list to "get all available implementations" and loop over them to ensure they all agree with one another (which is something I do in the tests). I also like defaulting the setting of impl to "auto" and including some heuristic that chooses the "best" implementation depending on availability for the given inputs (which is also how I propose to address the Cython integration).

@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 7, 2020

New work includes cleaning up the API, removing dead code, and addressing outstanding issues.

Something I didn't expect to do, but it just sort of happened was I wrote an "auto-jit" utility for cython that should simply autocompile the pyx file if that's possible, otherwise it will fallback on pure-python. Essentially, all that would need to change is including the pyx file as a module resource in setup.py. If the cython executable is available it will execute it to compile the SO file. I will also write a cache stamp file, so it won't do that again unless the source pyx has changed or if the SO was not found. We can always disable it by default (currently its enabled by default) or use an environ to control it. I think its a pretty neat idea --- cythonize if you have the right tools installed otherwise don't. Longer term it would be nice to use skbuild to package networkx wheels if it is going to contain binaries.

I also expanded the docs in several places with an emphasis on teaching about the algorithm.

There are two more items I'd like to address before a full review:

  • Cleanup examples/applications/filesytem_embedding.py
  • Add relevant non-docstring docs, changelog, etc...

Which brings me to a question: Is there any advice on where / how the rst docs should be updated for this new algorithm? Is there a changelog I need to fill out (or I believe that is generated via a git script).

@Erotemic Erotemic force-pushed the dev/ordered_subtree_embeddings branch 3 times, most recently from 213a5de to 3f21748 Compare November 8, 2020 18:18
@Erotemic Erotemic force-pushed the dev/ordered_subtree_embeddings branch from 4f75b5e to b1ed6b2 Compare November 8, 2020 21:42
@Erotemic Erotemic force-pushed the dev/ordered_subtree_embeddings branch from b1ed6b2 to a5c3299 Compare November 8, 2020 21:52
@Erotemic Erotemic marked this pull request as ready for review November 8, 2020 21:55
@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 8, 2020

I think I was able to figure out the docs well enough. I was able to compile them and look at them locally, and I think I handled most of the visual formatting issues. (I encountered more issues where xdoctest could help that, but its has to be at the sphinx level, so I think the next step is to integrate there).

This PR now depends on two others #4294 and #4326. They are included in the git history here for dashboard purposes. I think they should be pretty easy to review, but this is the more interesting PR and comments/reviews on it would allow me to continue work on this while the other PRs finish a full review.

I'm marking this PR as ready for review as the dashboards are passing and all components are in a state where I think they could go live.

There is an outstanding question though: based on my ever-growing understanding, embeddings are minors. Should we change the name of the package to minors/tree_minor.py? How should this integrate with the existing algorithms/minors.py? My thought is that I could make algorithms/minors package, expose the existing contents via __init__.py and then throw embeddings under minors.

Aside from that, I think this PR is in the done state. I don't have any plans to change anything pending review.

Pinging @jarrodmillman @dschult just to make them aware of the status change. I understand this probably won't get fully reviewed/integrated until 2021.

@Erotemic
Copy link
Contributor Author

Erotemic commented Nov 9, 2020

I made a version #4327 where I reorganized the minors subpackage. We can move forward with either this PR or the other one.

@Erotemic
Copy link
Contributor Author

I'm closing this in favor of #4327

I've made a few function name and signature tweaks such that it will just be simpler to work off the new PR than to backport those changes to this branch.

@Erotemic Erotemic closed this Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants