Status of Tree.cache and TreeModifier is unclear #1771

EliahKagan · 2023-12-13T09:22:29Z

Note: This bug seems either mitigated or fixed in #1799, and the writeup here does not reflect those changes.

In b295c13, Tree.cache was documented in CHANGES.rst as removed, based on the discovery in #369 that it could not serve its originally intended purpose. But Tree.cache, and its supporting code such as the TreeModifier class, were not removed. They remain part of the public interface, are not indicated as deprecated, and nothing in any docstrings counsels against their use.

I am unsure, based on this, if there is any reasonable use of them. My guess is that they may have been retained so that code that calls them without relying on the effect would not be broken, but I am not sure. If that is the case, it could be documented, DeprecationWarnings could be raised, and some of the non-public code may be possible to remove.

For example, even if TreeModifier.set_done needs to perform the sort, and do so with exactly the current effect, both the git_cmp and merge_sort functions could probably be eliminated entirely, since they were introduced to do something that was cumbersome (at least at the time, supporting Python 2 and 3) with the list.sort method. But this is no longer the case ever since 4f1110b. If I understand, this change made it no longer sort longer sequences first:

-    return len_b - len_a
+    return len_a - len_b

The text was updated successfully, but these errors were encountered:

Byron · 2023-12-13T10:20:04Z

Thanks for bringing this up.

Since I don't have any knowledge anymore about this implementation except that it is most likely incorrect and should rather not be used for anything serious, what's left to discuss, I presume, is if code should be deprecated so it can eventually be removed.

Maybe it would be worth trying to find it if it is usable for something after all (are there no tests for this?), or one could pre-emptively deprecate it to see if an outcry follows.

In general, I'd always be for removing code, especially if its effectiveness is unclear, but I'd also not want to risk breakage in case we are missing something.

As always, I'd rely on your expertise and analysis, with a strong slant towards following your suggestions.

EliahKagan · 2023-12-13T20:02:34Z

Maybe it would be worth trying to find it if it is usable for something after all (are there no tests for this?)

Currently there are no tests, as they were removed in b295c13, which possibly was also intended to remove the code under test, I am not sure. Some of that test code could potentially be brought back in some capacity (or just experimented with on a feature branch to try and figure out what, if anything, Tree.cache and TreeModifier might be used for). I don't know that they were able to pass: they appear to have been expanded as part of a change made to verify that Tree.cache and TreeModifier couldn't achieve their once hoped-for goal.

This is not necessarily the most important aspect of the situation, but if people are using this, and using it with a lot of data, then a possible benefit of replacing the hand-rolled mergesort implementation with a call to list.sort (which should be feasible without complicated code or big overhead, ever since the behavior of sorting longer names before their prefixes was abandoned) is that list.sort in CPython is Timsort, which is highly adaptive: sorting data that are already almost sorted takes only linear time. The overhead of re-sorting seems to have been a concern in guiding usage:

GitPython/git/objects/tree.py

Lines 126 to 135 in f55d194

    
               def set_done(self) -> "TreeModifier": 
        
                   """Call this method once you are done modifying the tree information. 
        
                   This may be called several times, but be aware that each call will cause 
        
                   a sort operation. 
        
                   :return self: 
        
                   """ 
        
                   merge_sort(self._cache, git_cmp) 
        
                   return self

Anyway, I don't propose that any such change be rushed into.

EliahKagan · 2024-01-24T05:31:58Z

In b295c13, this was added to the changelog:

CRITICAL: Tree.cache was removed without replacement. It is technically impossible to change individual trees and expect their serialization results to be consistent with what git expects. Instead, use the IndexFile facilities to adjust the content of the staging area, and write it out to the respective tree objects using IndexFile.write_tree() instead.

That commit was referenced in closing #369. Recently, work picked back up on this in #1799, which changed the sorting to use a custom key selector function that performs the comparison in a different way. Since then, this issue is out of date, but I am unsure if it is fully fixed.

I think the the point for this issue, and for improving documentation of TreeModifier and what advice should be given about whether and when to use it, is:

It is technically impossible to change individual trees and expect their serialization results to be consistent with what git expects.

What is the reason that was the case, and is it not still the case?

Byron · 2024-01-24T07:50:43Z

Unfortunately, I don't know and assume the git commit log also doesn't clarify it. I don't believe it's impossible either, but also don't believe that these facilities should be used - I think they can't be trusted and using them may proliferate subtle or (hopefully) less subtle breakage in git data structures when written to the object database.

If they were to be fixed, I'd probably focus on what GitPython was originally meant to be, that is a wrapper around the git executable, and implement all slightly more complex functionality by invoking git instead. That way it's probably slower particularly for many small trees, but at least it will definitely be correct.

With all that said, it's probably still unclear what to do here, apologies :/.

et-repositories · 2024-03-15T09:58:39Z

I've read the previous codes and docs before 1.0.2 - Fixes
But I'm still wondering what usage you expect originally meant to be in TreeModifier
mayberepo.head.commit.tree[0].cache.add(sha, mode, name)
or repo.head.commit.tree[0].cache.add([file_name]) just like repo.index.add([file_name]) but it's in subtree
can you provide a simple usage you expect originally ?

Byron · 2024-03-15T17:54:20Z

I am sorry, my memory has eroded over the last decade. However, the code in question should have tests which might make matters clearer. If there are no tests, then I think it's best to deprecate and discourage usage entirely, as I also think it's not really worth spending time on.

Byron added the acknowledged label Dec 13, 2023

EliahKagan mentioned this issue Jan 12, 2024

fix: treeNotSorted issue #1799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Status of Tree.cache and TreeModifier is unclear #1771

Status of Tree.cache and TreeModifier is unclear #1771

EliahKagan commented Dec 13, 2023 •

edited

Byron commented Dec 13, 2023

EliahKagan commented Dec 13, 2023

EliahKagan commented Jan 24, 2024

Byron commented Jan 24, 2024

et-repositories commented Mar 15, 2024

Byron commented Mar 15, 2024

Status of Tree.cache and TreeModifier is unclear #1771

Status of Tree.cache and TreeModifier is unclear #1771

Comments

EliahKagan commented Dec 13, 2023 • edited

Byron commented Dec 13, 2023

EliahKagan commented Dec 13, 2023

EliahKagan commented Jan 24, 2024

Byron commented Jan 24, 2024

et-repositories commented Mar 15, 2024

Byron commented Mar 15, 2024

EliahKagan commented Dec 13, 2023 •

edited