Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build perf is negatively impacted #114

Closed
3 of 6 tasks
AArnott opened this issue Mar 14, 2017 · 36 comments
Closed
3 of 6 tasks

Build perf is negatively impacted #114

AArnott opened this issue Mar 14, 2017 · 36 comments
Assignees

Comments

@AArnott
Copy link
Collaborator

AArnott commented Mar 14, 2017

Running PerfView during the build of the PInvoke project, I see that almost a full second goes to the GetBuildVersion task, which seems to be doing some repeat work perhaps within the same task execution, and certainly across projects. Can we cache the results across projects?

Name                                                                                                                                                            	Inc %	   Inc
 nerdbank.gitversioning.tasks!GetBuildVersion.Execute                                                                                                           	  2.2	   939
+ nerdbank.gitversioning!VersionOracle.Create                                                                                                                   	  2.2	   935
|+ nerdbank.gitversioning!Nerdbank.GitVersioning.VersionOracle..ctor(class System.String,class LibGit2Sharp.Repository,class Nerdbank.GitVersioning.ICloudBuild)	  2.2	   919
||+ nerdbank.gitversioning!GitExtensions.GetVersionHeight                                                                                                       	  0.9	   381
||+ nerdbank.gitversioning!GitExtensions.GetIdAsVersion                                                                                                         	  0.8	   340
||+ nerdbank.gitversioning!VersionFile.GetVersion                                                                                                               	  0.4	   181

Opportunities:

  • Add an incremental build aspect to the GetBuildVersion target where the results are saved to a file with the cache key being the tuple: version.json timestamp (and path?) and commit id. (@japj's idea)
  • Cache the height for one or more commits so even if the commit advances, we quickly find a commit whose height was predetermined in the cache file and cut off the potentially long git history traversal (@djluck's idea)
  • Invoke GetBuildVersion MSBuild task only once per project (or per repo) #508 Cut down on redundant GetBuildVersion executions per-project that @djluck identified, possibly by using the MSBuild task call to invoke another target but with a reduced set of global properties to improve target skipping around P2P and TargetFramework invocations.
  • Invoke GetBuildVersion MSBuild task only once per project (or per repo) #508 Build on the shared MSBuild target invocation within a single project to share it across projects based on a project property that specifies the directory path for which version.json will be searched for and used as the 'cache key' for reuse. This key can be a global property in the MSBuild task invocation to automatically cache at the right level, and this same property can serve as the input for the version.json search.
  • Perf: Cache JSON parse results while computing height #507 Never parse the same version.json content twice
  • Tell libgit2sharp to diff without considering file renames per @djluck's comment.
@AArnott
Copy link
Collaborator Author

AArnott commented Mar 17, 2017

Consider caching the results across projects using this technique.

AArnott added a commit that referenced this issue Jun 19, 2017
Partial performance improvement for #114
@KirillOsenkov
Copy link
Member

Yes, we're seeing perf issues as well:

image

Average invocation of the Nerdbank.GitVersioning.Tasks.AssemblyVersionInfo task is ~550ms for us.

@japj
Copy link
Contributor

japj commented Feb 3, 2019

What is the current state of this? (Haven’t done any performance analysis on our build yet but I am interested in “current” statistics + insights in if anything still needs to be done here.

@AArnott
Copy link
Collaborator Author

AArnott commented Feb 3, 2019

I've spent some time looking. The problem with caching across tasks is that each expensive result is necessarily keyed by the path to the project directory. Since each project has a different project directory, the result could be different and thus needs to be recalculated.

That leaves us with two options to improve perf:

  1. Allow large repos to "opt in" to assuming that all projects will share the same result and thus benefit from caching. That would likely drop the time to effectively near 0.
  2. Find inefficiencies in the git height walk or related code and improve the perf.

@japj
Copy link
Contributor

japj commented Feb 9, 2019

It looks like there are actually 2 performance areas: one related to GetBuildVersion and one with AssemblyVersionInfo.

If I understand correctly, the assembly versioninfofile is generated each time and only copied when it is different.
Perhaps the actual generation process is also too expensive? And hashing the settings and comparing them for differences before actual generation is useful?

@japj
Copy link
Contributor

japj commented Feb 9, 2019

Note that would be useful only to incremental builds of course.
Doing the code generation in a different way might result in another oppertunity for wins

@japj
Copy link
Contributor

japj commented Feb 19, 2019

I have a branch where I have added 2 additional AssemblyInfoTests: CSharpGenerator and CSharpCodeDomGenerator
https://github.com/japj/Nerdbank.GitVersioning/tree/AssemblyVersionInfoPerformanceTest

So far it seems (probably expected) the CodeDomGenerator codepath in AssemblyInfo is "very" expensive.
I also noticed that running on netcoreapp2.0 also gives a speed boost compared to net461.
(with the limited testing I have been able to do so far)

CSharpCodeDomGenerator (net461) ~160ms
CSharpGenerator (net461) ~50ms
CSharpGenerator (netcoreapp2.0) ~19ms

I have not looked into it too much, but is the CodeDomGenerator code path still needed feature wise?
(is there a historical reason this is still in there?)

@AArnott
Copy link
Collaborator Author

AArnott commented Feb 20, 2019

It was historically there because it meant NB.GV could support any .NET language for which a CodeDomProvider exists. Then we added support for a language or two that didn't have a CodeDomProvider, oh and .NET Core didn't offer them, so we had backup code paths for when they weren't available.
At this point, I don't mind if we simply remove the feature (or switch it to the fallback path instead of the preferred path) for better perf. I had no idea what perf hit it was incurring.

Would you care to send a PR, @japj ?

@AArnott AArnott self-assigned this Feb 28, 2019
AArnott added a commit that referenced this issue Feb 28, 2019
CodeDOM is *very* slow, it turns out.

Addresses on eof the perf issues reported in #114
@AArnott
Copy link
Collaborator Author

AArnott commented Feb 28, 2019

Please check out v2.3.125 which has the code DOM provider replaced with our homemade code generator to address the bulk of the perf problem. Let me know how much of a problem remains. :)

@japj
Copy link
Contributor

japj commented Feb 28, 2019

@AArnott Thanks for the great work, I got stuck in other stuff at work so did not yet have any time to work on it myself. Will definitely try out soon to verify impact

@japj
Copy link
Contributor

japj commented Mar 4, 2019

@AArnott I don't see AssemblyVersionInfo in the top 10 most expensive tasks anymore 👍
I still see GetBuildVersion in there though, perhaps the work in #223 can somehow be useful related to the caching of results.

@AArnott AArnott closed this as completed Mar 28, 2020
@azeno
Copy link
Contributor

azeno commented Jun 3, 2020

@AArnott Any chance we can get your initial suggestion to "opt-in" all projects sharing the same result? The GetBuildVersion task is still a major build performance bottleneck for (37 projects, most of them also generating a nuget):

image

GetBuildVersion is invoked 65 times taking ~400ms per invocation.

Or would there be a solution/project setup where one gets this kind of caching?

@AArnott AArnott reopened this Jun 5, 2020
@djluck
Copy link

djluck commented Aug 5, 2020

Just to add some observations to this issue, we are also seeing a large perf hit due to the Nerdbank.GitVersioning package. The excellent functionality is worth taking the perf hit for but I've spent a bit of time trying to understand how we could improve performance and found 2 key areas:

  1. Caching GetBuildVersion target output
    We have ~50 packages that we publish, with many packages taking dependencies on the other published packages. When running dotnet pack [our_solution].sln, I see that the $(BuildVersion) property generated by GetBuildVersion is not cached between project builds- this means that the version for each project is recalculated multiple times (once when the project is packaged and then once for every PackageReference that references the project).
    I'd estimate this is our biggest performance hit- ensuring that the version is only calculated once per-project would massively improve performance.

  2. Caching VersionFile.GetVersions
    Bit of background information- our git repository is very active, seeing ~1,000 commits per week and we don't often update the major/ minor version in our version.json files. In this case, VersionFile.GetVersion must be called once per-commit and per commit must:

  • Search for any version.json file(s) present
  • Read the contents from the git blob
  • Deserialize the contents

This work adds up over thousands of commits and was taking up a large proportion of runtime. I have trialed a fix in a fork where I cache the VersionOptions returned by VersionFile.GetVersion, although it needs some work before I'm ready to submit a PR.
My approach involves checking for version.json files at all possible locations in the git tree and if the SHAs have not changed, use the cached version. This supports our workflow well (SHAs won't change often as we don't update version.json much and our directory structure isn't particularly deep) but I'm not sure if it would be an acceptable general solution.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 5, 2020

Thanks for your feedback and efforts, @djluck.

MSBuild implicitly caches each target's execution, so GetBuildVersion only runs once per (project X global properties) (roughly). So if you see one project's GetBuildVersion target run more than once, you have an overbuild problem that's hurting you in more ways than just NB.GV, although NB.GV I can imagine being the most egregious time loss in your scenario. I would suggest you run your build with the /bl switch and then open the msbuild.binlog file with this tool and inspect the global properties for each invocation of a given project to see if indeed it's being invoked many times with many different global properties.

Once you get each project to only execute the GetBuildVersion target once, no caching within NB.GV for a project is necessary since MSBuild has already done it.

Sharing a cache between projects requires care because each project has a different path and thus may get different results in traversing the directory tree upward looking for version.json. Also a large set of projects (should) build with multiple processes, so even if we did share some data via a static variable such that it could avoid redundant work between projects, it would still only assist the projects that build in that particular msbuild process.
This may still be a promising avenue to explore, but I have in mind an even more impactful one, that will apply to some repos...

The change I'm planning for the largest repos, where we can assume there is just one version.json file at the root, is to run a GetBuildVersion-like step just once for the whole repo. The data from it would then be shared with all other projects in the repo so that they never have to execute their own GetBuildVersion step at all, bringing the NB.GV time in your build to a very small cost that only varies by the number of commits since the last version bump, instead of multiplying that time by the number of projects.
Would such an optimization apply to your case, @djluck?

@djluck
Copy link

djluck commented Aug 5, 2020

So if you see one project's GetBuildVersion target run more than once, you have an overbuild problem that's hurting you in more ways than just NB.GV

Thanks for the hint, I'll look into this using the tool you linked. My MsBuild knowledge is fairly incomplete but I wondered if properties like $BuildVersion were not being shared across parallelized msbuild instances.

The change I'm planning for the largest repos, where we can assume there is just one version.json file at the root, is to run a GetBuildVersion-like step just once for the whole repo.

Unfortunately this wouldn't help us (sadface)- we do have a version.json at the root of the project but we also have a version.json within each project we publish. Our repo is a monorepo where individual projects could introduce new features/ breaking changes at different times to other projects, so we need to be able to control the major + minor version at the project level.
I'll try and clean up and push my change today (even if in incomplete) along with some perfview analysis, hopefully this will help further any optimization efforts.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 6, 2020

but we also have a version.json within each project we publish

Then I think the best optimization that can apply for you would be to ensure the target only executes once per project as I mentioned.
of course if you find ways to simply make nbgv faster, that's great. And I'll be happy to look at any caching ideas you have. But I can't promise to take the PRs, particularly if it will compromise accuracy.

@japj
Copy link
Contributor

japj commented Aug 6, 2020

I just realized that in VisualStudio the GenerateAssemblyVersionInfo might actually run twice (once during design time build and once during a regular build). Maybe that is also something worthwhile to look into?

As for an incremental version of GetBuildVersion. Would it be enough to look at time stamps of version.json and current git commit (for inputs)?

@djluck
Copy link

djluck commented Aug 6, 2020

@AArnott thank you for the tip on using MSbuild structured log viewer- it makes understanding what MSBuild is doing so much easier.

So, coming back to this point:

Once you get each project to only execute the GetBuildVersion target once, no caching within NB.GV for a project is necessary since MSBuild has already done it.

I found out that GetBuildVersion was called four times per project being packed. I've included why they were being called and the global properties that differed:

  • Once per TargetFrameworks value as a child of the GenerateAssemblyVersionInfo target (we target two runtime in each project, TargetFramework was the differing global property)
  • Once per _GetProjectVersion (seems to be used to determine the version of any project references that would be exposed as packages, BuildProjectReferences=false was the differing global property)
  • Once per GenerateNuspec (determines the version of the package to pack, no additional global properties)

So I was certainly wrong in stating that GetBuildVersion was being called for every dependent project but 4 invocations per-project being dotnet pack'd does seem a waste, especially considering that the more frameworks you target the more invocations you'll see. Each call of GetBuildVersion takes about 15 seconds so it's certainly something that would be ideal to eliminate. I don't think there's anything I can do here to eliminate the additional calls via MsBuild as each differing global property seems appropriate to me.

Also more thoughts on this comment:

The change I'm planning for the largest repos, where we can assume there is just one version.json file at the root, is to run a GetBuildVersion-like step just once for the whole repo.

I think perhaps configurable behaviour for how version.json files are located would definitely help here- e.g. WalkDirectory, TopDirectoryOnly, RootAndLeaf, etc. Default could be WalkDirectory but then people would be free to pick a more optimal setting for their project.

@djluck
Copy link

djluck commented Aug 6, 2020

I'd be interested on your thoughts on this idea around caching the git height:

As the majority of time during the GetBuildVersion task is spent calculating the git height for the current commit, would it be possible to cache the git height for a given commit in a file (e.g. version-height.txt) per-project? e.g:

// commit SHA mapping to git height
ae85820d0g = 11

When GetBuildVersion runs, it can then check version-height.txt for a project- if the SHA of the head of the repository equals the SHA cached in the file, then it could use the cached value. If not, it could re-calculate the git height and overwrite the version-height.txt value.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 6, 2020

@japj said:

I just realized that in VisualStudio the GenerateAssemblyVersionInfo might actually run twice (once during design time build and once during a regular build). Maybe that is also something worthwhile to look into?

Running in design-time is important so the language service realizes that the ThisAssembly class exists. Running in the build is of course important because there's no guarantee that the design-time build ran earlier or that it's still current.

As for an incremental version of GetBuildVersion. Would it be enough to look at time stamps of version.json and current git commit (for inputs)?

That seems like a reasonable optimization. I guess we'd cache the version.json timestamp and commit in an intermediate file so we can read it later and compare with current values, right? I'd accept a PR for that if you're interested in preparing it.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 6, 2020

@djluck Good investigation on multiple executions of the GetBuildVersion target and the global properties to explain it. And good idea about caching the git height in a way that still works even after HEAD moves on. I've collected all these ideas as a checklist in the issue description. I think we have enough good ideas to move forward on some very impactful improvements.

What can I assign out to people? Are folks interested in taking any of these on?

@djluck
Copy link

djluck commented Aug 6, 2020

I'd definitely be interested in the caching of the height of commits, although I don't think I'll be able to put out a PR for at least a week or so.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 7, 2020

Great. Maybe write up a proposal for where the file would be, how many commits you would cache and how, Etc.

@filipnavara
Copy link
Member

Just to weight in a little bit. We are suffering with this pretty heavily on our incremental builds. We have around 35 projects in a solution that all use the same shared version.js file in the repository root. For each project the calculation of the version number takes something like 2 to 4 seconds (SSD, current gen AMD hardware). Even if the project builds run in parallel this adds up really quickly and makes the builds take several minutes. MSBuild log analyzer shows that the Nerdbank.GitVersioning.Tasks.GetBuildVersion task took 6:56.798. This doesn't account for the fact that this time is split over 8 cores in parallel but it's still 25 times more than the next biggest offending task.

Out of the ideas listed in the top post these ones would benefit us the most:

  • Add an incremental build aspect to the GetBuildVersion target where the results are saved to a file with the cache key being the tuple: version.json timestamp and path, and commit id.
  • Reuse the cache above when doing the height lookup. In most cases the traversal through the commit history will be linear and will hit the previously know commit id in few steps.

There are some other fundamental Git concepts that could help with speeding up the actual height calculation. Notably Git now has a concept of commit graph files and, more recently, bloom filters for quick checking of changed paths on per-commit basis. This allows really efficient history lookup for the purpose of calculating distance between two commits. In the commit graph files it is even expressed as generation number and thus the lookup can be done in constant time (with no path filter), or with very little I/O (with path filter and using the bloom filters to avoid accessing object storage). I doubt that Libgit2 supports it at the moment and it could be well outside the scope of this project but it's not difficult to implement support for reading these commit graph files in C# if it proves to be beneficial. I've written an implementation in Go before and it was pretty straightforward.

For anyone wanting to look into real-world data I am willing to privately share the MSBuild binary logs from our project.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 19, 2020

@filipnavara That's very interesting. I'd never heard of the commit graph files. I'll have to learn more about them.
6 minutes to calculate GetBuildVersion across a repo is too much for an incremental build, to be sure. I'll try to find time in the next week to apply one or more of these optimizations.

@japj
Copy link
Contributor

japj commented Aug 19, 2020

I believe the git commit graph was originally developed my people from Microsoft, but it’ll depend on having a supported client to generate the right files.

See also for more background at https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph/

@filipnavara
Copy link
Member

@japj Correct, I can only recommend reading the blog post series, it gives quite an insight about the design. Derrick moved to the GitHub team now and the bloom filter implementation was finished by someone else (although the theory in the last blog post of the series still holds for the official implementation).

The commit graph support itself is in the official GIT client for few versions now and it can be setup to automatically update the commit graph file on various operations. It may require some initial setup but the potential gain in performance is often worth it. Things like commit height, or calculating git log and git blame can get noticably faster.

It is an optimization that is orthogonal to any caching employed by GitVersioning but it's something that may be worth exploring for certain types of scenarios where simple caching is not doing the job.

@AArnott Thanks for looking into it!

@djluck
Copy link

djluck commented Aug 24, 2020

@AArnott I spent a bit of time on Friday trying to implement the caching of git height + commit id tuples. My approach was to add caching behaviour into GitWalkTracker- where on object construction any cached height can be loaded and on disposal (I made it implement IDisposable) the last recorded height would be persisted to disk. However this ran into some issues with overloads of GitExtensions.GetHeight that accept a lambda parameter to filter on if a commit should contribute to the height of a release- I wonder in this circumstances if we could ever rely on the cached height of a commit.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 24, 2020

I wonder in this circumstances if we could ever rely on the cached height of a commit.

I may not understand what you tried. But caching the height keyed on the commit ID and project directory should always be safe because the height of a given commit within a given directory can never change as git history is immutable, right?
That said, I don't think I would cache anything within the GitVersioning library itself. I think I'd put the caching in the msbuild task itself.

@djluck
Copy link

djluck commented Aug 24, 2020

That said, I don't think I would cache anything within the GitVersioning library itself. I think I'd put the caching in the msbuild task itself.

I think this was probably my mistake. I'll take a look down this avenue.

@AArnott
Copy link
Collaborator Author

AArnott commented Aug 25, 2020

OK, I'll hold off on any caching work for a while or till I hear back from you, @djluck so we don't solve the same problems or create bad merge conflicts.

@djluck
Copy link

djluck commented Aug 27, 2020

@AArnott PR is available here whenever you have the time to review: #499

@AArnott
Copy link
Collaborator Author

AArnott commented Sep 4, 2020

I'm preparing a change that should bring the number of GetBuildVersion invocations from (projects * targetframeworks) to just (projects) (which should help @djluck) and down to exactly 1 when there is just one version.json file in the repo and the repo 'opts in' to sharing version info across projects.

@AArnott
Copy link
Collaborator Author

AArnott commented Sep 4, 2020

As you can see on the checklist in the issue description, some substantial improvements have been made today. Most repos can get down to just one GetBuildVersion invocation (with very little work) and that will probably entirely solve their perf problems.

There is still value in the caching idea that @djluck has piloted in #499, but we should probably focus it on improving perf in the default scenario (not requiring any opt in or behavior changes) to maximize the benefit to the remainder of repos that don't want to take any opt-in effort to a faster approach.

@filipnavara
Copy link
Member

I tried version 3.3.22-alpha with the added GitVersionBaseDirectory property and I no longer see the GetBuildVersion task in the MSBuild performance traces at all; neither for the full rebuild nor for the incremental one. There are two invocations of GetBuildVersionCore that together consume around 126ms of time:

image

@AArnott
Copy link
Collaborator Author

AArnott commented Dec 19, 2020

We have made many fixes in this area over the last few releases, including the 3.4-beta.
We do have more ideas (listed in the issue description) to make it faster. But to focus on the most impactful work, I'm closing this issue until/unless we hear continued feedback that we need more fixes.

@AArnott AArnott closed this as completed Dec 19, 2020
AArnott pushed a commit that referenced this issue Sep 5, 2022
Bumps [Microsoft.NET.Test.Sdk](https://github.com/microsoft/vstest) from 16.10.0 to 16.11.0.
- [Release notes](https://github.com/microsoft/vstest/releases)
- [Commits](microsoft/vstest@v16.10.0...v16.11.0)

---
updated-dependencies:
- dependency-name: Microsoft.NET.Test.Sdk
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants