Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Map performance #10768

Closed
wants to merge 6 commits into from
Closed

Improved Map performance #10768

wants to merge 6 commits into from

Conversation

krauthaufen
Copy link
Contributor

PR for fslang-suggestion/940 with a faster Map<'Key, 'Value> implementation. The original code can be found here.

Creating this PR raised two questions from my side:

  • How shall we proceed with additional functionality I implemented? (test and expose vs. drop vs. keep and expose at a later time)
  • Shall I add the new Set implementation to this PR or create a separate one?

Cheers

@krauthaufen krauthaufen changed the title Improved Map performance [WIP] [WIP] Improved Map performance Dec 21, 2020
@krauthaufen
Copy link
Contributor Author

Oh and I had to reference System.ValueTuple in FSharp.Core.fsproj since I used struct-tuples in several places. Shall we just keep it that way or work around that problem?

Copy link
Member

@KevinRansom KevinRansom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a huge amount here, it's going to take a while to work through everything. I wonder is there a way that you can think of where we can have the old and the new implementations in FSharp.Core at the same time and and switch between them, whether using a preview switch or #defines or some other mechanism? It feels like a big risk taking it in one gulp without a preview.

@@ -222,6 +222,7 @@
</ItemGroup>

<ItemGroup>
<PackageReference Include="System.ValueTuple" Version="4.4.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary. Netstandard2.0 contains System.ValueTuple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still makes me a bit nervous to take a dependence on it, we have been avoiding it for so long in FSharp.Core but it is absolutely right to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got compiler errors when not referencing it, so I simply added it, we could certainly use KeyValuePair or some other struct for the implementation but I saw several other suggestions that would require struct-tuples so I didn't go through n removing it but I can certainly do so if you think it's the way to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile errors is surprising ... reference it, if you need to. I will take a look at why we don't get it vie netstandard2.0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the iDE (VS) I don't get compile-time errors when using a struct tuple, but maybe something is wrong in the build process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look, sometime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, this may have been a false alarm, i accidentally pulled master first instead of main, then switched and cherry-picked my map changes, so it might work without the reference after all.
I can look into that tomorrow (european tomorrow that is 😆)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay it really was useless, so I removed it again (would have been necessary for the net45 build)

@cartermp
Copy link
Contributor

@KevinRansom There's a lot of discussion to wade through here: fsharp/fslang-suggestions#940

But in general we should consider taking this and carefully review with @dsyme. The CPU and memory gains are amazing, and it would quite benefit our own perf scenarios since the compiler makes such heavy use of maps and sets. I profiled our compiler last Friday and it bubbled in the profiling session (as opposed to before when other things simply dominated traces).

@KevinRansom
Copy link
Member

@KevinRansom There's a lot of discussion to wade through here: fsharp/fslang-suggestions#940

But in general we should consider taking this and carefully review with @dsyme. The CPU and memory gains are amazing, and it would quite benefit our own perf scenarios since the compiler makes such heavy use of maps and sets. I profiled our compiler last Friday and it bubbled in the profiling session (as opposed to before when other things simply dominated traces).

@cartermp , yeah it looks great, I was wondering if there was a way to reduce the risk, it may well be that there is not.

@krauthaufen
Copy link
Contributor Author

@KevinRansom please note that there's quite a lot of unused code in this PR (mostly due to new combinators) and I'll definetly clean that up when you consider merging that. I can also do that sooner if you think it makes reviewing the code easier. Nonetheless I think combinators like tryAt and especially GetSlice / Union(With) would greatly improve the usefulness of this datasrructure.

@KevinRansom
Copy link
Member

@krauthaufen --- It would certainly make the code review easier, if there was just the code we want to merge. Keep the extra code somewhere safe though, I'm sure it's important :-)

krauthaufen added a commit to krauthaufen/MapNew that referenced this pull request Dec 22, 2020
* removed System.ValueTuple reference
@krauthaufen
Copy link
Contributor Author

@KevinRansom i reduced the file by ~1000 lines.
The remaining things are more or less necessary for the current Map API to work.

@KevinRansom
Copy link
Member

@krauthaufen , thank you my friend.

@KevinRansom
Copy link
Member

@krauthaufen , there is a small conflict.

# Conflicts:
#	src/fsharp/FSharp.Core/map.fs
@krauthaufen
Copy link
Contributor Author

krauthaufen commented Dec 23, 2020

@KevinRansom working on it, btw. I noticed that I didn't include proper error messages in my exceptions, will do so after the whole christmas-thing...

@krauthaufen
Copy link
Contributor Author

Hey, I fixed the remaining problems and the implementation is now ready for reviews. I'm of course available for questions/suggestions/etc.

@forki
Copy link
Contributor

forki commented Jan 4, 2021

maybe remove the WIP then.

@krauthaufen krauthaufen changed the title [WIP] Improved Map performance Improved Map performance Jan 4, 2021
@krauthaufen
Copy link
Contributor Author

Good point 😆

Copy link
Member

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot has changed, but this is implementation details and I'm confident the existing tests cover everything. From my view, this looks really good. The perf benchmarks look great too.

Copy link
Contributor

@cartermp cartermp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

Once this is in we'll also want to update the TaggedCollections module similar to #10192 - yes, unfortunately, the compiler itself actually uses a different set of types and APIs as per #10188 (comment)

Would you be interested in making that update in a separate PR? If not it's fine, we can take care of it.

@cartermp
Copy link
Contributor

cartermp commented Jan 5, 2021

Regarding this:

Shall I add the new Set implementation to this PR or create a separate one

A separate PR would be fantastic, yes

@krauthaufen
Copy link
Contributor Author

Would you be interested in making that update in a separate PR? If not it's fine, we can take care of it.

@cartermp I can certainly update the TaggedCollections too, but maybe we should wait until Set/Map are merged?

A separate PR would be fantastic, yes

I'll start to put a Set-PR together then, but first I'd like to validate that also the Set implementation is really faster.

@forki
Copy link
Contributor

forki commented Jan 5, 2021

Hey @buybackoff since you were the one who optimized this last time: could you please take a look?

@buybackoff
Copy link
Contributor

buybackoff commented Jan 5, 2021

Hi @forki

Hey @buybackoff since you were the one who optimized this last time: could you please take a look?

I could only repeat one of the top comments: @KevinRansom

There is a huge amount here, it's going to take a while to work through everything.

The huge single file changes make me scary. My changes were one-evening (map only) free lunch with keeping even code layout with pipes. There were some optimization left, e.g. devirtualizing comparer calls (comparer is always the default one for the main public Map/Set).

But initial thoughts:

  • It would be nice to see the most important benchmarks for Add/TryGet/Remove/Contains for 10K and 1M values.
  • If ofSeq/ofArray are important probably nothing will beat System.Collections.Immutable. Actually just wrapping over it may be the right solution (yet not politically correct).

These changes will take a while just to understand what's going on.

I have a question:

In the benchmarks here on a 4Hz machine contains key takes 22ns. In my PR on 3.2Hz (i8700, turbo boost off, using netcore3.1), after was 35ns, or 28ns if we adjust only for CPU frequency. In the benchmark for these changes, the before number is 55ns, that would be 68ns on my machine. And in my table, the before line is 64ns. Same math applies to getItem/tryFind.

So do we really measure the performance of the version 5.0 here?

Why not to use the exact same benchmark setup and extend it with the additional items?

Update: I linked to server GC results, the math for workstation GC is even closer.

@buybackoff
Copy link
Contributor

Well... I have just downloaded https://www.nuget.org/packages/FSharp.Core/5.0.0 and opened it with dotPeek 🤔🤦‍♂️

Actually my changes didn't make it to 5.0.

image

@krauthaufen
Copy link
Contributor Author

Actually my changes didn't make it to 5.0.

Oh my, so we actually know nothing about that, do we? 🤦‍♂️

I think most of my code will perform more or less equal to yours but I think ofArray, etc. will still be significantly faster. (due to its completely different approach)
Of course we would need to test that.
Sorry for the confusion, will do some benchmarks tomorrow.

Cheers

Copy link
Contributor

@cartermp cartermp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting up a block for now until we figure out FSharp.Core changes. Unfortunately, yes, it's a complicated process and we slip up sometimes.

@buybackoff
Copy link
Contributor

buybackoff commented Jan 5, 2021

Numbers vs current main branch. Using code from https://github.com/krauthaufen/MapNew

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.200-preview.20601.7
  [Host] : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG
  Main   : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG

Job=Main  MaxRelativeError=0.01  BuildConfiguration=LocalBuild  
IterationCount=10  WarmupCount=1  
Method Size Mean Error StdDev Rank Code Size Gen 0 Gen 1 Gen 2 Allocated
getItem 100 35.381 ns 0.8438 ns 0.5581 ns 1 124 B - - - -
getItem 10000 81.619 ns 1.2875 ns 0.8516 ns 2 124 B - - - -
getItem 1000000 109.375 ns 1.1262 ns 0.6702 ns 3 124 B - - - -
getItemNew 100 29.845 ns 0.3784 ns 0.2503 ns 1 282 B - - - -
getItemNew 10000 61.976 ns 0.3408 ns 0.2028 ns 2 282 B - - - -
getItemNew 1000000 93.273 ns 1.3122 ns 0.8679 ns 3 282 B - - - -
containsKey 100 30.116 ns 0.5946 ns 0.3538 ns 1 175 B - - - -
containsKey 10000 56.167 ns 1.1283 ns 0.7463 ns 2 175 B - - - -
containsKey 1000000 94.005 ns 2.1741 ns 1.4381 ns 3 175 B - - - -
containsKeyNew 100 31.732 ns 0.2470 ns 0.1470 ns 1 223 B - - - -
containsKeyNew 10000 61.257 ns 0.3315 ns 0.2193 ns 2 223 B - - - -
containsKeyNew 1000000 92.901 ns 1.1783 ns 0.7794 ns 3 223 B - - - -
itemCount 100 188.675 ns 2.8392 ns 1.8779 ns 1 96 B - - - -
itemCount 10000 34,833.115 ns 1,329.9992 ns 879.7121 ns 2 96 B - - - -
itemCount 1000000 8,603,079.688 ns 214,615.4686 ns 141,954.8443 ns 3 96 B - - - -
itemCountNew 100 1.745 ns 0.0415 ns 0.0274 ns 1 28 B - - - -
itemCountNew 10000 1.783 ns 0.0677 ns 0.0448 ns 1 28 B - - - -
itemCountNew 1000000 2.201 ns 0.0634 ns 0.0419 ns 2 28 B - - - -
iterForeach 100 3,177.495 ns 74.5034 ns 49.2794 ns 1 283 B 0.9727 - - 6120 B
iterForeach 10000 324,238.965 ns 4,494.5489 ns 2,674.6339 ns 2 283 B 95.2148 - - 600120 B
iterForeach 1000000 44,529,380.000 ns 1,319,685.6302 ns 872,890.3346 ns 3 283 B 9500.0000 - - 60000231 B
iterForeachNew 100 2,720.537 ns 39.5423 ns 26.1548 ns 1 1192 B 0.6409 - - 4040 B
iterForeachNew 10000 282,560.364 ns 4,493.2936 ns 2,350.0794 ns 2 1192 B 63.4766 - - 400040 B
iterForeachNew 1000000 37,056,267.857 ns 885,648.7498 ns 585,801.8121 ns 3 1192 B 6357.1429 - - 40000129 B
addItem 100 164.311 ns 3.2224 ns 2.1314 ns 1 603 B 0.0586 0.0003 - 369 B
addItem 10000 37,396.420 ns 749.3590 ns 445.9315 ns 2 603 B 11.0156 3.5156 - 69324 B
addItem 1000000 6,349,347.900 ns 89,281.6470 ns 59,054.2815 ns 3 603 B 1600.0000 90.0000 - 10096697 B
addItemNew 100 217.753 ns 3.9022 ns 2.5811 ns 1 178 B 0.0586 0.0003 - 369 B
addItemNew 10000 47,099.414 ns 820.5227 ns 542.7249 ns 2 178 B 11.0156 3.3594 - 69324 B
addItemNew 1000000 7,763,597.333 ns 119,464.6711 ns 71,091.5095 ns 3 220 B 1600.0000 90.0000 - 10096697 B
removeItem 100 12.204 ns 0.3569 ns 0.2361 ns 1 626 B 0.0064 - - 40 B
removeItem 10000 1,188.044 ns 13.2493 ns 8.7636 ns 2 626 B 0.6372 - - 4000 B
removeItem 1000000 121,771.113 ns 2,547.2179 ns 1,684.8269 ns 3 626 B 63.7500 - - 400001 B
removeItemNew 100 11.795 ns 0.3364 ns 0.2225 ns 1 145 B 0.0064 - - 40 B
removeItemNew 10000 1,146.194 ns 18.3236 ns 12.1199 ns 2 145 B 0.6372 - - 4000 B
removeItemNew 1000000 156,384.700 ns 46,225.3597 ns 30,575.2133 ns 3 145 B 60.0000 - - 400000 B

@krauthaufen
Copy link
Contributor Author

krauthaufen commented Jan 5, 2021

Hey, cool that you did some benchmarks, i will add mine as soon as i have time...
As far as i see the current implementation seems to be a little faster for some cases, nonetheless I think the ofArray performance will be significantly better.

So if that turns out to be true I'll just submit that as a PR.

@yatli
Copy link
Contributor

yatli commented Jan 6, 2021

@buybackoff your benchmark job is running in DEBUG:

[Host] : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG
Main : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG

Job=Main MaxRelativeError=0.01 BuildConfiguration=LocalBuild

I'm not sure why that happens -- the job should've be run in a child process in release mode (no DEBUG at the end of the line)
Try dotnet run -c Release?

@buybackoff
Copy link
Contributor

@yatli

It's this dotnet/BenchmarkDotNet#1493.

Using the workaround doesn't change results more than a noise.

@krauthaufen
Copy link
Contributor Author

Hey, I finally got to run the benchmarks and the results (for count=100) seem consistent with what we've seen so far.
After all my implementation (New in the Benchmarks) is a tiny bit slower for some operations (add, remove, etc.), a little faster for some others (tryFind, etc.). The operations drastically faster are toArray (due to the known count), ofArray (due to the merge-sort implementation)

After all I see two options here:

  1. use the new Map implementation (and tweak it a little s.t. add/remove are on par with current)
  2. use the current implementation and improve at least ofArray/ofList/etc. (maybe also include the count in inner nodes?)

What do you think?

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.200-preview.20601.7
  [Host]     : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT DEBUG
  Job-MJAPUR : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT

Server=False  IterationTime=100.0000 ms  MaxIterationCount=20
Method Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Map_add 100 14,666.15 ns 97.249 ns 86.209 ns 1.00 0.00 6.1047 - - 38992 B
MapNew_add 100 16,628.96 ns 135.702 ns 120.296 ns 1.13 0.01 5.9211 - - 37360 B
Map_remove 100 14,483.32 ns 151.238 ns 134.069 ns 1.00 0.00 5.3571 - - 34368 B
MapNew_remove 100 15,675.43 ns 208.992 ns 174.518 ns 1.08 0.02 4.6875 - - 29608 B
Map_ofArray 100 11,584.76 ns 150.759 ns 125.890 ns 1.00 0.00 4.2098 - - 28360 B
MapNew_ofArray 100 5,507.70 ns 64.988 ns 54.268 ns 0.48 0.01 0.8741 - - 5600 B
Map_ofList 100 11,040.13 ns 219.175 ns 234.515 ns 1.00 0.00 4.3478 0.1087 - 27304 B
MapNew_ofList 100 6,011.46 ns 72.356 ns 67.682 ns 0.54 0.01 1.0664 - - 6792 B
Map_ofSeq 100 12,125.45 ns 115.617 ns 102.491 ns 1.00 0.00 4.5166 0.1221 - 28552 B
MapNew_ofSeq 100 6,193.12 ns 102.929 ns 96.280 ns 0.51 0.01 1.0386 - - 6792 B
Map_toArray 100 1,547.31 ns 23.038 ns 20.423 ns 1.00 0.00 1.0110 0.0153 - 6424 B
MapNew_toArray 100 776.14 ns 5.881 ns 4.911 ns 0.50 0.01 0.5076 0.0077 - 3224 B
Map_toList 100 1,074.89 ns 13.688 ns 12.804 ns 1.00 0.00 0.8908 0.0220 - 5600 B
MapNew_toList 100 1,127.79 ns 12.803 ns 10.691 ns 1.05 0.02 0.8852 0.0224 - 5600 B
Map_enumerate 100 3,105.46 ns 48.375 ns 45.250 ns 1.00 0.00 1.1766 - - 7800 B
MapNew_enumerate 100 2,975.47 ns 22.260 ns 20.822 ns 0.96 0.01 0.6194 - - 4000 B
Map_toSeq_enum 100 4,895.17 ns 41.850 ns 37.098 ns 1.00 0.00 1.6563 - - 10472 B
MapNew_toSeq_enum 100 4,999.20 ns 55.518 ns 49.215 ns 1.02 0.01 1.0408 - - 6600 B
Map_containsKey_all 100 2,417.70 ns 17.477 ns 15.493 ns 1.00 0.00 - - - -
MapNew_containsKey_all 100 2,358.98 ns 40.268 ns 41.353 ns 0.98 0.02 - - - -
Map_containsKey_nonexisting 100 27.02 ns 0.216 ns 0.180 ns 1.00 0.00 - - - -
MapNew_containsKey_nonexisting 100 22.03 ns 0.168 ns 0.157 ns 0.82 0.01 - - - -
Map_tryFind 100 31.46 ns 0.288 ns 0.255 ns 1.00 0.00 0.0035 - - 24 B
MapNew_tryFind 100 23.94 ns 0.166 ns 0.147 ns 0.76 0.01 0.0038 - - 24 B
Map_tryFind_nonexisting 100 24.36 ns 0.379 ns 0.336 ns 1.00 0.00 - - - -
MapNew_tryFind_nonexisting 100 22.16 ns 0.220 ns 0.195 ns 0.91 0.02 - - - -
Map_remove_all 100 10,491.85 ns 49.821 ns 38.897 ns 1.00 0.00 4.4044 - - 28216 B
MapNew_remove_all 100 13,698.76 ns 66.820 ns 55.797 ns 1.31 0.01 3.8631 - - 24712 B
Map_exists 100 308.65 ns 1.629 ns 1.444 ns 1.00 0.00 0.0031 - - 24 B
MapNew_exists 100 351.49 ns 2.939 ns 2.749 ns 1.14 0.01 0.0035 - - 24 B
Map_fold 100 258.62 ns 3.153 ns 2.633 ns 1.00 0.00 0.0026 - - 24 B
MapNew_fold 100 328.32 ns 2.504 ns 2.342 ns 1.27 0.01 0.0033 - - 24 B
Map_foldBack 100 264.86 ns 2.028 ns 1.797 ns 1.00 0.00 0.0027 - - 24 B
MapNew_foldBack 100 333.76 ns 2.223 ns 1.970 ns 1.26 0.01 0.0034 - - 24 B

@buybackoff
Copy link
Contributor

What do you think?

  • Virtual methods would be the first thing I would try to remove if I wanted to optimize a hot path. It's actually surprising for me that they do not kill the performance too much. Maybe devirtualization kicks in. Also there are two isinst checks on hot paths, instead of one checking for a sealed class and a null check.
  • It's more OOP rather than FP. If no one cares about that, then the best optimization would be really just to wrap over System.Collections.Immutable. Minus couple of thousands of code lines in total and faster performance. But concise functional recursive AVL tree implementation is nice on it's own.
  • In my view, get/contains/add/remove with structural sharing (aka immutability, not thread safety) is the most important usage of immutable maps/sets. If immutability is used only for thread-safety then it's wrong to use trees, ConcurrentDictionary is 10x faster. With that, bulk operations such as AddRange may be important. Serialization is a strong point. If one cares about performance, it's unlikely that built-in serialization is of any use ever. For latency, I would prefer to deserialize in a streaming manner, i.e. from IEnumerable. For throughput and big number of values creating a copy of array and sort it just takes too much memory (but less GC in the new implementation, see next point). With immutable collections event sourcing and serializing deltas is the best, so that one could have a snapshot of a map at multiple points.
  • System.Collections.Immutable has a clever way of optimizing bulk create operations. They have a field with IsFrozen state, and they mutate the nodes during bulk inserts. For the current case, we may treat all nodes as mutable for ofArray/ofSeq methods until those methods return. That should reduce GC a lot.
  • Size matters, both the line count (912 current vs 1705 new) and compiled code size. E.g. there are 3 different implementations for Add. In most cases the one from a node is used, then at the last step it's one from a leaf. I'm not sure what BDN reports as the code size, maybe just the size of the entry point, and it doesn't count all the code size that is touched during the execution.
  • Set and complier collections are already done with the current version.

Huge improvement here are only for bulk creation, at the cost of array creation/copying and pre-sorting data to keep a tree balanced during addition. That could be done using Array.Sort(keys[],values[],..,comparer) method, without special cases for small count, in a new method.

So, with the arguments above, I'm for option 2 not only because it's mine and have never had a daylight 😄

@krauthaufen
Copy link
Contributor Author

  1. Devirtualiuation can't kick here since the node types are unknown and i don't think theres any type-testing involved in the virtual calls (basically a pointer-fetch, an addition and an indirect call) i have several old benchmarks that virtual calls are way faster than type-test (also requiring some kind of null check) for large branching factors (obviously not for three though)
  2. The ImmutableDictionary, etc. weren't that awesome when i did my HashMap benchmarks but maybe that has changed since then.
  3. The mutable in-place add is actually implemented in my map and used in ofArray for small counts

Nonetheless i totally get your point and I'm absolutely fine with keeping the current implementation. I actually implemented it for having additional functions and then stumbled upon the performance gains.

Cheers

@buybackoff
Copy link
Contributor

Devirtualiuation can't kick here since the node types are unknown and i don't think theres any type-testing involved in the virtual calls

The virtual calls in many cases are after isinst check, maybe that works. Unfortunately all the new rules are scattered across dotnet/runtime repo, couldn't find a single place with a list of all cases after multiple improvements by @AndyAyersMS

The ImmutableDictionary, etc. weren't that awesome when i did my HashMap benchmarks but maybe that has changed since then.

From the very old discussion the benchmarks were faster: #5360 (comment)

The mutable in-place add is actually implemented in my map and used in ofArray for small counts

I mean methods like map.AddRange(pairs:IEnumerable<K,V>) -> map, i.e. optimize multiple operations with less GC but keep immutability.

@krauthaufen
Copy link
Contributor Author

Here's my (also very old) HashMap comparison https://github.com/krauthaufen/ImmutableHashCollections which also included the Map (which of course is a different datastructure) for reference which showed that add wasn't too good in the ImmutableDictionary, however the lookup was terribly fast..

@AndyAyersMS
Copy link
Member

find a single place with a list of all cases

Unfortunately there is no simple way to describe what can be devirtualized; it depends a lot on the generated IL, what other optimizations the jit can do, etc.

I can look if you like but it may be a day or two before I have time. If you know how to use a checked jit you can enable jit dumps and see what the jit itself has to say.

Also, we are always looking for good F# benchmarks, please consider contributing some to https://github.com/dotnet/performance.

@buybackoff
Copy link
Contributor

@AndyAyersMS
Thanks for confirming that I didn't miss anything.

I can look if you like but it may be a day or two before I have time. If you know how to use a checked jit you can enable jit dumps and see what the jit itself has to say.

I haven't even profiled the new implementation, so cannot even say if it was devirtualized or not. It's very easy to see just in dotTrace.

For this discussion a very relevant to your work is static readonly devirtualization. It's impossible to create a static readonly field and access it directly in F#. I think if that was possible, the default comparer for primitives could be inlined. Now we do an interface call even for int32 comparer. dotTrace showns 14% in Compare for tryGetValue.

@buybackoff
Copy link
Contributor

Improved the performance of the current version at small memory cost here: #10845

@cartermp
Copy link
Contributor

cartermp commented Jan 7, 2021

@krauthaufen in light of the discussion here, how possible is it to extract the improvements you've made (e.g. to ToArray) independently of the rest?

@krauthaufen
Copy link
Contributor Author

Hey, after doing yet another map implementation (Yam) with the insights gained here I managed to make most operations a bit faster (especially note the ofArray performance which now doesn't copy anything and therefore runs with O(1) scratch-memory) @cartermp this basically treats the tree as mutable during build (in ofArray, ofList and ofSeq) and reuses as much nodes as possible.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.746 (2004/?/20H1)
Intel Core i7-4930K CPU 3.40GHz (Haswell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT DEBUG
  Job-WQVKON : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Server=False
Method Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Yam_add 100 15,401.56 ns 243.887 ns 216.199 ns 0.80 0.02 5.8289 - - 36608 B
Map_add 100 19,254.27 ns 378.406 ns 566.381 ns 1.00 0.00 5.7373 - - 36136 B
Yam_remove 100 15,461.79 ns 307.595 ns 554.657 ns 0.79 0.04 4.9744 - - 31232 B
Map_remove 100 19,426.96 ns 381.377 ns 546.959 ns 1.00 0.00 5.4321 - - 34144 B
Yam_ofArray 100 9,095.73 ns 181.949 ns 367.546 ns 0.60 0.02 0.7019 0.0153 - 4416 B
Map_ofArray 100 15,010.79 ns 194.285 ns 162.237 ns 1.00 0.00 4.4556 0.1373 - 28000 B
Yam_toArray 100 1,379.79 ns 27.016 ns 42.850 ns 0.65 0.03 0.5131 0.0114 - 3224 B
Map_toArray 100 2,122.62 ns 41.669 ns 66.091 ns 1.00 0.00 1.0223 0.0191 - 6424 B
Yam_containsKey_all 100 3,586.66 ns 55.260 ns 51.690 ns 1.00 0.08 - - - -
Map_containsKey_all 100 3,622.26 ns 71.927 ns 176.438 ns 1.00 0.00 - - - -
Yam_containsKey_nonexisting 100 28.18 ns 0.570 ns 0.610 ns 0.85 0.02 - - - -
Map_containsKey_nonexisting 100 33.33 ns 0.581 ns 0.515 ns 1.00 0.00 - - - -
Yam_exists 100 380.36 ns 5.128 ns 4.546 ns 0.92 0.01 0.0038 - - 24 B
Map_exists 100 413.72 ns 3.673 ns 3.436 ns 1.00 0.00 0.0038 - - 24 B

When maintaining a count per inner-node (allowing for O(1) count and O(log N) positional queries) the results are still relatively acceptable:

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.746 (2004/?/20H1)
Intel Core i7-4930K CPU 3.40GHz (Haswell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT DEBUG
  Job-NTZQCI : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Server=False
Method Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
Yam_add 100 17,137.13 ns 338.754 ns 527.398 ns 0.88 0.03 5.7983 - - 36424 B
Map_add 100 19,499.56 ns 309.162 ns 289.190 ns 1.00 0.00 5.8594 - - 36832 B
Yam_remove 100 17,596.50 ns 344.689 ns 423.309 ns 0.98 0.03 5.8289 - - 36672 B
Map_remove 100 17,981.42 ns 359.338 ns 413.814 ns 1.00 0.00 5.4321 - - 34112 B
Yam_ofArray 100 11,144.64 ns 221.712 ns 217.751 ns 0.67 0.02 0.7782 0.0153 - 4912 B
Map_ofArray 100 16,669.08 ns 331.241 ns 622.151 ns 1.00 0.00 4.5776 0.1221 - 28744 B
Yam_toArray 100 1,025.46 ns 18.328 ns 30.114 ns 0.48 0.03 0.5131 0.0114 - 3224 B
Map_toArray 100 2,137.10 ns 40.995 ns 66.199 ns 1.00 0.00 1.0223 0.0191 - 6424 B
Yam_containsKey_all 100 3,575.39 ns 70.659 ns 96.719 ns 1.05 0.03 - - - -
Map_containsKey_all 100 3,428.17 ns 67.072 ns 74.550 ns 1.00 0.00 - - - -
Yam_containsKey_nonexisting 100 33.52 ns 0.708 ns 1.038 ns 1.08 0.03 - - - -
Map_containsKey_nonexisting 100 31.19 ns 0.516 ns 0.403 ns 1.00 0.00 - - - -
Yam_exists 100 366.18 ns 7.328 ns 9.268 ns 0.95 0.03 0.0038 - - 24 B
Map_exists 100 385.18 ns 5.914 ns 6.328 ns 1.00 0.00 0.0038 - - 24 B

However note that the Map benchmarks differ between the two runs, so I'll investigate.

If you're still interested in the (now smaller) improvements I can create a new PR (or adapt this one). Note that I have something in mind for keeping the overall count (not per inner node but globally per map) that should be quite efficient.
Please let me know what you think. The implementation is here

@cartermp
Copy link
Contributor

Interesting. I think it'd be interesting to look at a fresh PR with your findings, with measurements against latest main.

cartermp pushed a commit that referenced this pull request Feb 26, 2021
The improvements to Map have only made it to 5.0.1 nuget package, not 5.0.0
see #10768 (comment)
cc @buybackoff  @krauthaufen
@KevinRansom
Copy link
Member

@krauthaufen ,

Big changes like this scare us to death, there are a lot of conflicts, and not much traffic for 6 months. Could you possibly find a way to parcel up the changes into smaller PRs that are easier to digest and less scary.

Closing for now

Thanks

Kevin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants