Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RAR's GetReferenceItems method #5929

Merged
merged 23 commits into from Feb 6, 2021
Merged

Conversation

Forgind
Copy link
Member

@Forgind Forgind commented Dec 2, 2020

This change should not alter behavior other than changing casing on metadata (that should be case-insensitive), but it includes a series of changes (see individual commits) that simplify the code, correct typos, or prevent MSBuild from having to do unnecessary work. (Note that this means that most of the commits are expected to have no relevant impact on perf despite the title of this PR.) From building OrchardCore once with and once without this change, RAR's execution time declined by about 2.5% when there were no caches. This part of RAR does not rely on the caches, so this gain will not be wiped out by including caches.

Changes "true" to "True" and "false" to "False", but that's ok because it should be case-insensitive.
This means we don't have to repeatedly unset it.
…emoved once

nonForwardableMetadata?.Remove(ItemMetadataNames.*) is used to prevent the correct metadata value from being overwritten when metadata are re-added from nonForwardableMetadata.
CopyOnWriteDictionary<string> copiedMetadata = _metadata.Clone(); // Copy on write!
foreach (KeyValuePair<string, string> entry in destinationAsTaskItem.Metadata)
{
copiedMetadata[entry.Key] = entry.Value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify the purpose of this change? Is this to optimistically optimize the case where you are copying metadata values that exactly match the destination (in value and count) - in this case the foreach here will not cause a "copy" of the dictionary because it will just be writing the same values? If that's the case, why not destinationAsTaskItem.Metadata?.Count == _metadata?.Count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was previously an optimization for if destinationAsTaskItem.Metadata was null, i.e., its count was 0, but there are also cases in which its count is small compared to _metadata?.Count. This is trying to improve that case, where there are a few metadata defined for destinationAsTaskItem but not as many as for the source. In that case, it's better to modify a clone of the source rather than modifying the destination directly.

CopyOnWriteDictionary<string> copiedMetadata = _metadata.Clone(); // Copy on write!
foreach (KeyValuePair<string, string> entry in destinationAsTaskItem.Metadata)
{
copiedMetadata[entry.Key] = entry.Value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be escaping, as on line 331? I'm not sure, but you'll want to make sure there's a test. IIRC, everything is always escaped when it enters the MSBuild engine, and unescaped on the way out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copying metadata from a TaskItem to a TaskItem, so its escaped status should be the same in both cases. Just copying it over without escaping or unescaping it makes sense to me.

Line 331 is for ITaskItems that are not ITaskItem2s, and this will always be an ITaskItem2, so if some escaping/unescaping is warranted, it should be replacing entry.Value with destinationAsTaskItem.GetMetadataValueEscaped(entry.Key), but that's less performant, and I don't understand why that would be warranted. Doesn't mean I'm right, though.

{
referenceItem.SetMetadata(ItemMetadataNames.copyLocal, "false");
}
referenceItem.SetMetadata(ItemMetadataNames.copyLocal, reference.IsCopyLocal.ToString());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert to ternary operator

Also, comparing to a null value apparently always returns false, so the check I was using was bogus. This fixes that.
{
destinationAsTaskItem.Metadata = _metadata.Clone(); // Copy on write!
CopyOnWriteDictionary<string> copiedMetadata;
if (destinationAsTaskItem.Metadata == null || destinationAsTaskItem.Metadata.Count < _metadata.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment, please. And I think I personally would find it clearer if there were three explicit cases:

  1. Destination has no metadata.
  2. Destination has less metadata than source.
  3. Source has more metadata than destination.

@@ -2777,7 +2765,6 @@ private ITaskItem SetItemMetadata(List<ITaskItem> relatedItems, List<ITaskItem>
// Clone metadata.
referenceItem.CopyMetadataTo(item);
// We don't have a fusion name for scatter files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these comments please.

Comment on lines 2788 to 2789
// Set the FusionName metadata properly.
referenceItem.SetMetadata(ItemMetadataNames.fusionName, fusionName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "properly" mean? In general, how do you know this is safe? Is the local variable fusionName guaranteed to be empty for all the paths where it was previously unset? How do you know?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of reference.GetSourceItems() has a fusionName, or reference.PrimarySourceItem has one, fusionName may have been set to an invalid value at this point. We remove it before forwarding referenceItem's metadata to other items, but we still need to set it properly at the end, i.e., give it the right fusionName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind clarifying what you meant about fusionName being guaranteed to be empty? Previously, fusionName was set to a value near the top of this method, that value was passed on to a variety of "relatedItems," "satelliteItems," etc., and then it was unset from all of those other items. Here, instead, I'm unsetting it prior to using fusionName for any of those, and only giving it its value at the bottom, so CopyMetadataTo automatically skips fusionName.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meta comment is: "properly" is basically useless in a code comment. Comments describe what you want the code to do/why it's done, and of course you want to do things "properly". Instead, describe details. Here, you might have a comment that fusionName should be set last, so that it doesn't get copied to the derived items but is still available on the primary item.

item.RemoveMetadata(ItemMetadataNames.winmdImplmentationFile);
item.RemoveMetadata(ItemMetadataNames.imageRuntime);
item.RemoveMetadata(ItemMetadataNames.winMDFile);
Dictionary<string, string> metadata = new Dictionary<string, string>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a more descriptive name for this variable? removedMetadata?

if (!String.IsNullOrEmpty(meta))
{
metadata.Add(ItemMetadataNames.targetPath, meta);
}
item.RemoveMetadata(ItemMetadataNames.targetPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this one has to be next to the add to the new list, can you move the others to be next to their equivalents too?

Comment on lines +2686 to +2689
List<string> relatedFileExtensions = reference.GetRelatedFileExtensions();
List<string> satellites = reference.GetSatelliteFiles();
List<string> serializationAssemblyFiles = reference.GetSerializationAssemblyFiles();
string[] scatterFiles = reference.GetScatterFiles();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut feeling is that these will all be small enough that keeping them live all at once instead of in sequence won't be a memory problem. But do you have data on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. How would I get such data? Build OrchardCore and...somehow check whether something is being paged out?

// Set the FusionName metadata properly.
referenceItem.SetMetadata(ItemMetadataNames.fusionName, fusionName);

if (nonForwardableMetadata != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment please

Comment on lines 2772 to 2773
referenceItem.SetMetadata(ItemMetadataNames.winmdImplmentationFile, Path.GetFileName(reference.ImplementationAssembly));
nonForwardableMetadata?.Remove(ItemMetadataNames.winmdImplmentationFile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These inverted remove-from-the-removal-list things definitely need a comment please.

@@ -3455,7 +3455,7 @@ public void PrimaryFXAssemblyRefIsNotCopyLocal()

Assert.Single(t.ResolvedFiles);
Assert.Equal(Path.Combine(s_myVersion20Path, "System.Data.dll"), t.ResolvedFiles[0].ItemSpec);
Assert.Equal("false", t.ResolvedFiles[0].GetMetadata("CopyLocal"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert all these please since you went back to the ternary.

// This happens when the redist root is "null". This means there
// was no IsRedistRoot flag in the Redist XML (or there was no
// redist XML at all for this item).
referenceItem.SetMetadata(ItemMetadataNames.isRedistRoot, (bool)reference.IsRedistRoot ? "true" : "false");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast stands out. Should it be

Suggested change
referenceItem.SetMetadata(ItemMetadataNames.isRedistRoot, (bool)reference.IsRedistRoot ? "true" : "false");
referenceItem.SetMetadata(ItemMetadataNames.isRedistRoot, reference.IsRedistRoot == true ? "true" : "false");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both ways work. We did a null check just before this, so it's either true or false.

else
{
copiedMetadata = destinationAsTaskItem.Metadata.Clone();
copiedMetadata.SetItems(Metadata.Where(entry => !destinationAsTaskItem.Metadata.ContainsKey(entry.Key)));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure between using Where or setting them one-by-one in this case. I did some benchmarking, and this is what I got:
image
Note that for each test case, I added NumStrings / 10 items to an ImmutableDictionary with NumStrings items in it previously.

You can see that using Where had fewer allocations for all three sizes. (Base is creating but not combining the dictionaries, so Where actually allocated very little.) It lost on time, though, unless the dataset was large, and I don't have data as to whether that's expected sometimes, always, or never. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you instead do copiedMetadata = Metadata.Clone().SetItems(destinationAsTaskItem.Metadata)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and honestly, there are probably a lot of cases in which that would be better than what I currently have. What I'm trying to optimize for is adding metadata from a small dictionary to a large dictionary, and in that case, it would be better to add a few items individually rather than using SetItems to set a lot of items. I can run a test for that, though, and see if it's optimized nicely, in which case I'll switch to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like doing it that way was 50-75% slower if the adding dictionary had 10x as many items. Next, trying it with a smaller ratio...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried comparing adding the larger dictionary to the smaller dictionary for ratios of 1, 2, 5, and 8, and it looks like setting with Where was faster for all but 1x size. I currently think that's the best option.
image

else
{
copiedMetadata = destinationAsTaskItem.Metadata.Clone();
copiedMetadata.SetItems(Metadata.Where(entry => !destinationAsTaskItem.Metadata.ContainsKey(entry.Key)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original non-optimized code down in the else branch uses string.IsNullOrEmpty(value) to decide if the metadatum should be overwritten. It doesn't seem to match your new logic here - you only check if the key exists in the metadata dictionary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Comment on lines 2941 to 2946
string meta = item.GetMetadata(ItemMetadataNames.winmdImplmentationFile);
if (!String.IsNullOrEmpty(meta))
{
removedMetadata.Add(ItemMetadataNames.winmdImplmentationFile, meta);
}
item.RemoveMetadata(ItemMetadataNames.winmdImplmentationFile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Refactor these statements into a local function to be called 4 times?

/// </summary>
public void SetItems(IEnumerable<KeyValuePair<string, V>> items)
{
_backing = _backing.SetItems(items);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetItems and Where will also need to be implemented on the .NET 3.5 version of ImmutableDictionary in C:\src\msbuild\src\MSBuildTaskHost\Immutable\ImmutableDictionary.cs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Why isn't Where implemented by LINQ as it is for most IEnumerables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! I stand corrected, only SetItems is missing.

Forgind and others added 2 commits December 7, 2020 08:46
@@ -2818,6 +2747,61 @@ private ITaskItem SetItemMetadata(List<ITaskItem> relatedItems, List<ITaskItem>
}
}

if (reference.IsWinMDFile)
{
// The ImplementationAssembly is only set if the implementation file exits on disk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Typo exits.

Comment on lines 310 to 311
copiedMetadata = _metadata.Clone(); // Copy on write!
copiedMetadata.SetItems(destinationAsTaskItem.Metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is doing the same thing as the else right below. I would expect the !IsNullOrEmpty check here as well.

The else below:

  1. Take all of destination item's metadata.
  2. Overwrite with this item's metadata that are missing or empty on the destination item.

This block:

  1. Take all of this item's metadata.
  2. Overwrite with all of destination item's metadata.

For a missing destination metadatum this behaves the same, but for an empty one the SetItem call in this block will erroneously overwrite it with the empty value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. I'm wondering if we're going at this the wrong way, though, since that would mean we'd need another Where statement to filter out non-empty metadata, which would make this optimization noticeably less of an optimization. Empty metadata should be equivalent (in all cases) to missing metadata, I think. If that's true, it would probably be more efficient to maintain an invariant that TaskItems don't have empty metadata. The COW dictionary storing the metadata is private, so that isn't a problem, but it's theoretically a breaking change unless we keep a separate list for "metadata added as empty" for the unfortunately public MetadataNames (and MetadataCount). What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't go as far as to introduce a separate list for empty metadata unless it proves impactful. Can you measure how much this optimization is gaining us with and without handling empty?

Copy link
Member Author

@Forgind Forgind Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured 3 scenarios (building from a clean state (I appended /restore to the command line for this one), building with caches but not processes to connect to, and building with both) for 3 versions of TaskItem (as it is in master, with this change plus a Where statement on the else if branch, and with all the empty metadata stored in a HashSet once each for building OrchardCore. Note that I kept the other optimizations in this PR for all three versions of TaskItem. I used my computer instead of a VM, mostly because my VM seems to be slow...but I also tried to avoid doing anything too intense in the background. Note that I forgot to use -m for the first several tests, so I continued to not use it for the last few. If you want new results with having that enabled for all of them, I can do that.

I have full results saved away if you want to look. My first takeaway was that caches really matter. It took more than twice as long to build in every case if I deleted them and restored them. In contrast, presumably because OrchardCore is big, and I neglected to enable -m, having processes already running only saved about 170ms on average.

As far as differences between the versions of TaskItem, here are some times for the entire build in ms:

MSBuild execution times Clean build Process not running Process running
Master 108475 49657 49704
Where statements 108253 48140 47949
Empty HashSet 105464 49105 48708

Since it occurred to me that this has impacts outside RAR, I don't think just looking at RAR time makes sense. Are there other data you'd like to see? I'm not sure why the Where statements was so much worse than keeping an empty list for the clean state build—it's possible that was an outlier, since I only ran each of these once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I was actually expecting you to measure this particular aspect with a microbenchmark. Basically, see if the SetItems call is helping (even with the Where) given the expected metadata count. I second Ben's question below about what the realistic number of metadata is. Maybe you could add a logging statement here while building OrchardCore and see where the breaking point is relative to the distribution of metadata count seen during the build. (Assuming there is a breaking point and the new code is actually a bit slower if the number of metadata is small.)

Measuring the entire build end-to-end would have to be repeated several times to get trustworthy results. As you wrote, it could have been an outlier and, honestly, this optimization is likely going to be well within noise when running the entire thing.

Copy link
Member Author

@Forgind Forgind Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Merging with @benvillalobos's comment) I'm currently a bit confused by the results I got when I tried to test this. I put in a simple check of how many custom metadata each taskitem has when its metadata is copied over, as well as the amount of metadata on the destination. The answer seemed to be that the destination never had any custom metadata for either MSBuild.sln or OrchardCore.sln, in which case we would never hit the case I optimized, and the time would be almost identical between them unless the other changes I made (which could have a minor effect, but I didn't expect to be really relevant) had much more of a positive effect than I'd expected. The same test suggested that the source TaskItems had 20-50 pieces of metadata, of which 15 were not custom metadata, so these wouldn't be huge numbers, but this operation also got called a lot, which might mean that a small optimization could have somewhat noticeable effects. What do you make of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good argument against pursuing this optimization further. If you really wanted to be sure, you could run the build under a profiler to see how much time we're spending in this function - which would be the upper limit of what any micro-optimization here could give us.

{
copiedMetadata = _metadata.Clone(); // Copy on write!
}
else if (destinationAsTaskItem.Metadata.Count < _metadata.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely are we to hit the case where destinationAsTaskItem.Metadata.Count < _metadata.Count and both metadata count are very large? I don't think I've seen items with a standout number of metadata, but MSBuild has surprised me before.

Do we happen to have any idea of a "realistic average" number of metadata per TaskItem that comes through here? Something more detailed than just assuming it's the number of well-known metadata?

@Forgind
Copy link
Member Author

Forgind commented Jan 26, 2021

I added one more commit that should make it do exactly what the previous code path had done in all cases. Rather than fully reverting that part, I left it in, although I don't expect that to be a perf optimization anymore, just a potential improvement for some customers. If you want me to fully revert it and just leave the other part, I can, but it feels wrong to undo something that could theoretically improve perf if I don't think it could worsen perf.

else
{
copiedMetadata = destinationAsTaskItem.Metadata.Clone();
copiedMetadata.SetItems(Metadata.Where(entry => !destinationAsTaskItem.Metadata.TryGetValue(entry.Key, out string val) || String.IsNullOrEmpty(val)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: Directly read this._metadata instead of calling the this.Metadata getter here for consistency with the if branches?

@Forgind Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Feb 1, 2021
A TaskItem in AppDomain A is identical to a TaskItem in AppDomain B, but the TaskItems cannot access each other's fields, only their properties. This changes accessing a possibly remote TaskItem's _metadata field into accessing its Metadata property.
@Forgind Forgind dismissed rainersigwald’s stale review February 2, 2021 17:38

won't have the opportunity to re-review

@Forgind Forgind merged commit 8fb627e into dotnet:master Feb 6, 2021
@Forgind Forgind deleted the optimize-rar branch February 6, 2021 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants