Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make separate sub-tree-view vs just-git-tree classes or git.Tree runtime modes #1505

Open
castedo opened this issue Oct 28, 2022 · 11 comments
Open
Assignees

Comments

@castedo
Copy link

castedo commented Oct 28, 2022

This enhancement proposal is motivated by confusion in the behavior of subtrees. Issue #851 is one example. This stackoverflow question is another example.

As further example consider the following bash+python script behavior:

mkdir testrepo
cd testrepo
git init
mkdir 1
mkdir 2
echo hi > 1/hi.txt
cp 1/hi.txt 2/.
git add .
git commit -m 'test'

and then

import git

repo = git.Repo()
one = repo.head.commit.tree['1']
two = repo.head.commit.tree['2']
print("Same object?", one is two)
print("Same Python hash?", hash(one) == hash(two))
print("Same git hash?", one.hexsha == two.hexsha)
print("Equal?", one == two)
print("Same paths?", one.path == two.path)
print("hi.txt in one and two?", "hi.txt" in one, "hi.txt" in two)
print("1/hi.txt in one and two?", "1/hi.txt" in one, "1/hi.txt" in two)
print("2/hi.txt in one and two?", "2/hi.txt" in one, "2/hi.txt" in two)

which prints out:

Same object? False
Same Python hash? False
Same git hash? True
Equal? True
Same paths? False
hi.txt in one and two? False False
1/hi.txt in one and two? True False
2/hi.txt in one and two? False True

What I think most people would expect is that "hi.txt" is contained in both one and two and 1/hi.txt and 2/hi.txt are contained in neither one nor two. This is essentially the same issue noted in #851.

I think most people would also not expect these Tree objects to be ==-equal but then also have a property path that has different values.

It seems to me the fundamental problem is that the Tree class is trying to behave like two things that are different things:

  • a git tree, in the git object database
  • a subdirectory/subtree in a git index

In the sample above, one and two are the same git tree but they are not the same subdirectory added to a git index. A git tree does not have a path but a subdirectory added to a git index does. The current __contain__ logic is kind-of-OK if the objects are subdirectories added to a git index, but make no sense for a git tree.

Due to backwards compatibility, this seems a challenging issue to address. I don't know the code and backwards compatibility situation well enough to make any recommendations. But I'm happy to help. I can suggest some different approaches, but I figure it's best to seek your @Byron input on this first.

@Byron
Copy link
Member

Byron commented Oct 28, 2022

Thanks a lot for the detailed description of the problem and for making it reproducible, it's much appreciated.

My stance since the comment in #851 didn't change and improving the situation is appreciated.

In the linked comment I called it a bug which would allow a fix without a major version bump, and probably that's something to review as I now tend to play it safe and put a fix behind a major version bump. Maybe you have some ideas about this as well that you would like to share to flesh out the path forward a little more.

@castedo
Copy link
Author

castedo commented Oct 28, 2022

Here's some rough thoughts I can throw out here.

One decision fork is whether to represent git-tree-in-git-db and subdir-in-git-index as separate classes or some binary runtime state that git.Tree can be in. I'm guessing here that keeping git.Tree as one class, not introducing more classes, and just letting git.Tree flip between two possible runtime types is the more practical path given our reality. So I'll proceed assuming this. But maybe this is the wrong approach so feel free to question this assumption.

Assuming dual runtime state for git.Tree is the way to go, one idea that comes to mind is whether git.Tree.path == None might be convenient way to represent at runtime that the git.Tree is really strictly a git-tree-in-git-db and not a subdir-in-git-index.

An additional idea is that there be a git.Tree.detach method which creates a new strict git-tree-in-git-db from a mildly ambiguous git.Tree instance. This feels like it could be a safe way to not break any backwards compat. With this approach I think the fix could be that in my example above a users must do

one = repo.head.commit.tree.detach('1')
two = repo.head.commit.tree.detach('2')

rather than use [] and then one and two will be "pure" git-tree-in-git-db objects and everything works as expected.

Lots of possible directions and further implications. But I'll leave you with this idea. Let me know what you think. I'm mostly just brainstorming here.

@Byron
Copy link
Member

Byron commented Oct 29, 2022

Thanks a lot for sharing!

In gitoxide, accessing an entry of a tree yields exactly that, an Entry, which provides access to the 'git-object-in-db', it's name and its file mode.

I love the idea of adding another way of accessing tree objects (or entries) for that matter without breaking what's there necessarily.

one = repo.head.commit.tree.entry('1')  # detach -> entry
two = repo.head.commit.tree.entry('2')

Maybe the above can be sugared into

one = repo.head.commit.tree('1') # difference between ['1'] and ('1')
two = repo.head.commit.tree('2')

if that's possible in python at all (which it might). And even if it was possible, maybe tree.entry('1') is the more explicit choice.

What do you think?

@castedo
Copy link
Author

castedo commented Oct 29, 2022

Good point about the name and file mode in the git db. That makes more sense to return that extra entry information in addition to the git db tree/blob.

You can achieve repo.head.commit.tree('1') by overridding a __call__ method. But to me it seems better to take the more explicit choice. At least in the context of Python this feels like it should be more explicit and readable whereas use of __call__ would be unexpected/intuitive in Python.

entry seems a good method name.

So overall, sounds like a good idea. Are you thinking a new class? something like git.objects.Entry or git.objects.____.Entry?

And then within Entry would there be a method that returns either a git.Blob or git.Tree that would be in this "in-git-db-not-index" mode with path == None ?

@Byron
Copy link
Member

Byron commented Oct 30, 2022

Good to hear that tree.entry(<path>) seems intuitive, let's go with that then.

So overall, sounds like a good idea. Are you thinking a new class? something like git.objects.Entry or git.objects.____.Entry?

I'd think of it as git.objects.tree.Entry, in any case it must be clear that it's an entry of a Tree. And as for how to achieve that, I would love to leave the decision to you as I don't know what's idiomatic anymore.

And then within Entry would there be a method that returns either a git.Blob or git.Tree that would be in this "in-git-db-not-index" mode with path == None ?

I don't think I have a clear answer, but rather some intuition on how I'd like to use such object.

a = repo.head.commit.tree.entry('1/a')  # access to any entry with implicit tree traversal
a.name == "a" # true
a.id == <hash> # I don't know what the object id type is in GitPython, but probably that :D.
a.mode == 0o<octet> # probably a 32 bit number, I think that's used in GitPython already. Could be something that's easier to use as well if needed if that's already present in GitPython

# now, `a` could have all kinds of utility methods, the most obvious might be…
a.object() # returns the object obtained with `.id`

# possibly, it could support returning its own `path()` being `1/a`, but that would require it to keep an additional field to know it's parent path, and I feel it's not needed as the caller already knows that.

Looking at it, the above example does not try to identify an entry by its repository-relative path anymore, maybe there are some desirable properties that are missing.

Maybe you can take the above and put it into a shape you find more acceptable? I am not past ideation here, and more options and suggestions would definitely be welcome.

@castedo
Copy link
Author

castedo commented Oct 30, 2022

I'd think of it as git.objects.tree.Entry

👍

a = repo.head.commit.tree.entry('1/a')  # access to any entry with implicit tree traversal
a.name == "a" # true
a.mode == 0o<octet> # probably a 32 bit number, I think that's used in GitPython already. Could be something that's easier to use as well if needed if that's already present in GitPython

👍

@castedo
Copy link
Author

castedo commented Oct 30, 2022

# now, `a` could have all kinds of utility methods, the most obvious might be…
a.object() # returns the object obtained with `.id`
# possibly, it could support returning its own `path()` being `1/a`, but that would require it to keep an additional field to know it's parent path, and I feel it's not needed as the caller already knows that.

Regarding a path() method in Entry, I think this depends on whether you want the following to be true or false.

repo.head.commit.tree.entry('1/hi.txt') == repo.head.commit.tree.entry('2/hi.txt')

with the example at the start of this gitub issue. If it is true then they should not return a path. If it is false then Entry should provide different paths.

I don't have a strong opinion one way or the other on Entry.__eq__ and Entry.path. Probably more inclined to make them equal and not have path since that sounds simpler and easier to implement.

I have more of an opinion on Entry.object().__eq__ and Entry.object().path() which I'll write about shortly.

@castedo
Copy link
Author

castedo commented Oct 30, 2022

To accomplish the objective of this git issue/enhancement, my thought is that the following runs without assert fails (given the example at the start of this github issue/enhancement):

old_one = repo.head.commit.tree['1']
old_two = repo.head.commit.tree['2']
new_one = repo.head.commit.tree.entry('1').object()
new_two = repo.head.commit.tree.entry('2').object()
assert type(old_one) == git.Tree
assert type(old_two) == git.Tree
assert type(new_one) == git.Tree
assert type(new_two) == git.Tree
assert old_one == new_one
assert old_two == new_two
assert new_one == new_two
assert old_one.path == '1'
assert old_two.path == '2'
assert old_one.path != old_two.path
assert new_one.path == new_two.path
assert bool(new_one.path) == False
assert bool(new_two.path) == False

I don't have a good sense of whether it should be new_one.path is None or new_one.path == "".

To me it makes sense that there should not be a path attribute in git.Tree at all. The object represents a tree in the git db and equality is based on the git hash. Path takes no part in this so to me it should not be part of that object.

But for backwards compatibility I assume path must be retained and behave as it has been. I'm guessing that at the moment path never returns None. So perhaps it is safest to have it return "". On the other hand, maybe it is useful for path == None to indicate that a git.Tree instance is truly a "pure" git tree and does not correspond to a subdirectory/subtree inside an index. Then a "pure" git.Tree can have _[_] behave the same as _.entry(_).object() and perhaps in the distant future the default is for get.Tree instances to be "pure".

@castedo
Copy link
Author

castedo commented Oct 30, 2022

Also, I'm not so sure a_tree.entry(p) should work for p where p not in a_tree. If asked "How many entries are in the root tree?" I would say two. What are they? They are "1" and "2" and that the blob for hi.txt is not an entry in the root tree but rather is an entry in a subtree.

@Byron
Copy link
Member

Byron commented Oct 30, 2022

I don't have a strong opinion one way or the other on Entry.eq and Entry.path. Probably more inclined to make them equal and not have path since that sounds simpler and easier to implement.

Great catch, it's definitely an argument for not having path() in Entry, and I'd be happy to keep it simpler as well.

But for backwards compatibility I assume path must be retained and behave as it has been. I'm guessing that at the moment path never returns None. So perhaps it is safest to have it return "". On the other hand, maybe it is useful for path == None to indicate that a git.Tree instance is truly a "pure" git tree and does not correspond to a subdirectory/subtree inside an index. Then a "pure" git.Tree can have [] behave the same as .entry().object() and perhaps in the distant future the default is for get.Tree instances to be "pure".

I love the idea of letting a Tree created with _.entry.object() return None for .path to make it pure, and to make clear why comparisons use a hash and not the path. Existing code wouldn't break, and new code will make more sense thanks to tree.entry(_).

Also, I'm not so sure a_tree.entry(p) should work for p where p not in a_tree. If asked "How many entries are in the root tree?" I would say two. What are they? They are "1" and "2" and that the blob for hi.txt is not an entry in the root tree but rather is an entry in a subtree.

And… I love this idea too - it's not making anything up and instead 'just' exposes the tree in the simplest possible way. Yes, let's not support resolution of paths (i.e. paths with /) with .entry(p). Instead, and only if there is time, there can be another method like lookup_entry(p) which supports slashes.

Note that I would refrain from deprecating the current API for this as I hope there won't be another major release before gitoxide's python bindings replace GitPython in one fell swoop.

@castedo
Copy link
Author

castedo commented Oct 30, 2022

Note that I would refrain from deprecating the current API for this as I hope there won't be another major release before gitoxide's python bindings replace GitPython in one fell swoop.

OK. Given the impressive work you're doing on gitoxide, I can see the reason to freeze this API and let it be replaced by a new backwards incompatible replacement based on gitoxide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants