Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache most popular string objects in String.InternalSubString #73540

Closed
Tracked by #79053
EgorBo opened this issue Aug 7, 2022 · 12 comments
Closed
Tracked by #79053

Cache most popular string objects in String.InternalSubString #73540

EgorBo opened this issue Aug 7, 2022 · 12 comments

Comments

@EgorBo
Copy link
Member

EgorBo commented Aug 7, 2022

I've analyzed a couple of memory snapshots using dotMemory and it feels to me that String.InternalSubString is the main source of duplicated strings on the heap, e.g. total heap statistics for dotnet publish -c Release ... (msbuild+roslyn+nuget+illink+crossgen) over a medium-size project:

image

500K string objects, mostly from Substring

And here is a memory snapshot taken in a random place during that dotnet publish:

image

Also, checked VS2022's snapshot, AvaloniaILSpy if you have good example to look at - please share

My impression that it's worth special casing "0", "1", "true", "false" in String.InternalSubString and those paths should not hurt perf much since string compare operations are unrolled, e.g.:

image

here it polls span's length and if it's 1 or 4 it runs a single compare with a constant to check if it's a known string.

Related: "Improve Density of GC heap by String Interning (de-duping) on Gen2 GC" #9022

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 7, 2022
@EgorBo EgorBo added area-System.Runtime and removed untriaged New issue has not been triaged by the area owner labels Aug 7, 2022
@ghost
Copy link

ghost commented Aug 7, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

I've analyzed a couple of memory snapshots using dotMemory and it feels to me that String.InternalSubString is the main source of duplicated strings on the heap. E.g. here is a snapshot for dotnet publish -c Release ... (msbuild+roslyn+nuget+illink+crossgen) over a medium-size project

image

dotMemory gives hints where these duplicated strings come from and in case of "true" all are from String.InternalSubString

Also, checked VS2022's snapshot, AvaloniaILSpy if you have good example to look at - please share

My impression that it's worth special casing "0", "1", "true", "false" in String.InternalSubString and those paths should not hurt perf much since string compare operations are unrolled, e.g.:

image

here it polls span's length and if it's 1 or 4 it runs a single compare with a constant to check if it's a known string.

Related: "Improve Density of GC heap by String Interning (de-duping) on Gen2 GC" #9022

Author: EgorBo
Assignees: -
Labels:

area-System.Runtime, untriaged

Milestone: -

@EgorBo EgorBo added this to the 8.0.0 milestone Aug 7, 2022
@EgorBo
Copy link
Member Author

EgorBo commented Aug 7, 2022

Prototype: EgorBo@1c73b3e

@jkotas
Copy link
Member

jkotas commented Aug 7, 2022

String substring does not feel like the right place to do this optimization. Have you looked ať the top callers that create these strings? Would it make more sense to fix the callers instead? We do caller specific de-duplication in number of places, for example in integer formatting.

@EgorBo
Copy link
Member Author

EgorBo commented Aug 7, 2022

In case of dotnet publish and "true" "false" it's

image

I assume it's here: https://github.com/dotnet/msbuild/blob/3ade6423189769545ddff2ffeeed37010ec57f4d/src/Build/Evaluation/Conditionals/Scanner.cs#L561-L641

@EgorBo
Copy link
Member Author

EgorBo commented Aug 7, 2022

Still, I think at least these two lines of code worth adding: https://github.com/EgorBo/runtime-1/blob/c487556f0a1131833ce22dfe40cb89fca88b5dcf/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs#L1850-L1854 - cheap handling for '0'..'9', right?

UInt32ToDecStr does:

internal static unsafe string UInt32ToDecStr(uint value)
{
// Intrinsified in mono interpreter
int bufferLength = FormattingHelpers.CountDigits(value);
// For single-digit values that are very common, especially 0 and 1, just return cached strings.
if (bufferLength == 1)
{
return s_singleDigitStringCache[value];
}

@jkotas
Copy link
Member

jkotas commented Aug 7, 2022

I am not sure. Substrings of this shape are likely going to be produced by allocation heavy parsers, they are unlikely to survive gen0 gc, and so getting rid of them is unlikely to make a difference in the app high water mark that tends to be dominated by gen2.

This optimization is trying to improve certain inefficient allocation heavy parsers a tiny bit, but makes all substring calls to pay for it.

@danmoseley
Copy link
Member

We used to intern some hard coded strings like true and false in MSBuild including in that codepath ("OpportunisticIntern"). I can't look at the code at the moment but I believe it was removed a little while ago. @Forgind @rainersigwald

@danmoseley
Copy link
Member

@EgorBo EgorBo closed this as completed Aug 8, 2022
@EgorBo
Copy link
Member Author

EgorBo commented Aug 8, 2022

Thanks for the feedback, I'll check if it's worth special-casing on msbuild's side

We used to intern some hard coded strings like true and false in MSBuild including in that codepath ("OpportunisticIntern"). I can't look at the code at the moment but I believe it was removed a little while ago. @Forgind @rainersigwald

From my understanding, string interning doesn't help with String.Concat/String.Substring which are the main sources of string allocations

@jkotas
Copy link
Member

jkotas commented Aug 8, 2022

msbuild has custom string deduplication engine that is able to take spans (that covers substrings) and string builders (that covers concatenated strings), without material if the string. Check description of dotnet/msbuild#5663 for details.

Roslyn or S.R.Metadata have custom string interning optimized for the domain specific patterns as well.

@rainersigwald
Copy link
Member

In case of dotnet publish and "true" "false" it's

image

For this callsite we definitely have some work to do spanifying the parsing, but the strings themselves should be extremely short-lived because of the deduplication engine @jkotas mentioned.

@dotnet dotnet locked as resolved and limited conversation to collaborators Sep 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants