Tracking symlink changes #696

jgarvin · 2023-01-21T00:06:37Z

jgarvin
Jan 21, 2023

Summary 💡

I have working code using git_repository to search through all commits backwards from a branch tip that affect files with names matching a user supplied regex. This works great, and combined with rayon is about 1,728x faster than my original git2 code (even accounting for running on a 128 cores, this is a >10x per core speedup :D).

However I would like to add tracking through symlinks. Git just models symlinks as blobs that contain a path with a bit to indicate it's intended to be a symlink. This means if commit 1 creates symlink A to file B, and then commit 2 modifies B, in the diff I only see a modification of B, not a modification of A. But I want to consider it a modification of A for purposes of determining if the user regex matches the commit.

Further complicating things branches point to the tip, and the natural way to iterate is backwards through history, but to incrementally process symlink state from scratch we would need to start at the beginning of the repo state. So my plan instead is:

Examine all files in checkout of branch tip, and build up a hash table of symlink hash to target.
Iterate backwards up the commit chain, altering the hash table in response to diff contents as I go
Use a persistent hash table implementation so "forking" the hash table to deal with commits with multiple parents is a cheap operation, and forks can be iterated in parallel

The vast majority of commits do not modify symlinks, so the number of distinct hash table states I have should be small enough to fit in memory. At the end, every commit would have one of these tables associated with it, so that when I compute diffs I can lookup if any of the files changed have symlinks that match the regex pointing at them.

Questions:

Any thoughts on if this is a good general strategy? Am I missing something builtin that could help me do this already?
One challenge is using rev_walk for this -- when a commit has multiple parents I need to visit each (parent, child) pair in order to compute all the diffs, but Walk only has the one "current" commit. I'm not sure if there is a way to get this effect without needing to reimplement Walk?

Motivation 🔦

The code base I'm running this on switched from using files that are edited in place to a crazy rats nest of symlinks in the middle of its history :(

Answered by Byron

Jan 21, 2023

(converted issue to Q&A as I don't think it's actionable as an issue)

Any thoughts on if this is a good general strategy? Am I missing something builtin that could help me do this already?

Gaining a 10x speedup per core seems to be an indication that the general strategy isn't the worst! That said, an object cache might be useful to avoid having to decode the same object multiple times if you are not using one already.

One challenge is using rev_walk for this -- when a commit has multiple parents I need to visit each (parent, child) pair in order to compute all the diffs, but Walk only has the one "current" commit. I'm not sure if there is a way to get this effect without needing to re…

View full answer

Byron · 2023-01-21T08:28:11Z

Byron
Jan 21, 2023
Maintainer

(converted issue to Q&A as I don't think it's actionable as an issue)

Any thoughts on if this is a good general strategy? Am I missing something builtin that could help me do this already?

Gaining a 10x speedup per core seems to be an indication that the general strategy isn't the worst! That said, an object cache might be useful to avoid having to decode the same object multiple times if you are not using one already.

One challenge is using rev_walk for this -- when a commit has multiple parents I need to visit each (parent, child) pair in order to compute all the diffs, but Walk only has the one "current" commit. I'm not sure if there is a way to get this effect without needing to reimplement Walk?

As the iterator inherently flattens the commit graph, there doesn't seem to be an easy way to get the control you need without at least dropping down to git_traverse::commit::Ancestors. That way, you would get to resolve commits yourself which allows to keep extra-state for each seen commit. Maybe that's enough. If not, I see no other way than to adapt the git_traverse::commit::Ancestors code to suit your needs.

Maybe in doing so you would discover some sort of pattern that could be provided by the base implementation as well, allowing the an actual enhancement to the gitoxide code base that you could switch to.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking symlink changes #696

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Tracking symlink changes #696

jgarvin Jan 21, 2023

Summary 💡

Motivation 🔦

Replies: 1 comment

Byron Jan 21, 2023 Maintainer

jgarvin
Jan 21, 2023

Byron
Jan 21, 2023
Maintainer