Improved Map performance #10768

krauthaufen · 2020-12-21T13:32:27Z

PR for fslang-suggestion/940 with a faster Map<'Key, 'Value> implementation. The original code can be found here.

Creating this PR raised two questions from my side:

How shall we proceed with additional functionality I implemented? (test and expose vs. drop vs. keep and expose at a later time)
Shall I add the new Set implementation to this PR or create a separate one?

Cheers

krauthaufen · 2020-12-21T13:34:47Z

Oh and I had to reference System.ValueTuple in FSharp.Core.fsproj since I used struct-tuples in several places. Shall we just keep it that way or work around that problem?

KevinRansom

There is a huge amount here, it's going to take a while to work through everything. I wonder is there a way that you can think of where we can have the old and the new implementations in FSharp.Core at the same time and and switch between them, whether using a preview switch or #defines or some other mechanism? It feels like a big risk taking it in one gulp without a preview.

KevinRansom · 2020-12-21T20:36:32Z

src/fsharp/FSharp.Core/FSharp.Core.fsproj

@@ -222,6 +222,7 @@
  </ItemGroup>

  <ItemGroup>
+    <PackageReference Include="System.ValueTuple" Version="4.4.0"  />


This shouldn't be necessary. Netstandard2.0 contains System.ValueTuple.

It still makes me a bit nervous to take a dependence on it, we have been avoiding it for so long in FSharp.Core but it is absolutely right to do so.

I got compiler errors when not referencing it, so I simply added it, we could certainly use KeyValuePair or some other struct for the implementation but I saw several other suggestions that would require struct-tuples so I didn't go through n removing it but I can certainly do so if you think it's the way to go.

The compile errors is surprising ... reference it, if you need to. I will take a look at why we don't get it vie netstandard2.0

Within the iDE (VS) I don't get compile-time errors when using a struct tuple, but maybe something is wrong in the build process?

I will take a look, sometime.

Hey, this may have been a false alarm, i accidentally pulled master first instead of main, then switched and cherry-picked my map changes, so it might work without the reference after all.
I can look into that tomorrow (european tomorrow that is 😆)

Okay it really was useless, so I removed it again (would have been necessary for the net45 build)

cartermp · 2020-12-21T21:23:51Z

@KevinRansom There's a lot of discussion to wade through here: fsharp/fslang-suggestions#940

But in general we should consider taking this and carefully review with @dsyme. The CPU and memory gains are amazing, and it would quite benefit our own perf scenarios since the compiler makes such heavy use of maps and sets. I profiled our compiler last Friday and it bubbled in the profiling session (as opposed to before when other things simply dominated traces).

KevinRansom · 2020-12-21T21:26:50Z

@KevinRansom There's a lot of discussion to wade through here: fsharp/fslang-suggestions#940

But in general we should consider taking this and carefully review with @dsyme. The CPU and memory gains are amazing, and it would quite benefit our own perf scenarios since the compiler makes such heavy use of maps and sets. I profiled our compiler last Friday and it bubbled in the profiling session (as opposed to before when other things simply dominated traces).

@cartermp , yeah it looks great, I was wondering if there was a way to reduce the risk, it may well be that there is not.

krauthaufen · 2020-12-21T22:02:56Z

@KevinRansom please note that there's quite a lot of unused code in this PR (mostly due to new combinators) and I'll definetly clean that up when you consider merging that. I can also do that sooner if you think it makes reviewing the code easier. Nonetheless I think combinators like tryAt and especially GetSlice / Union(With) would greatly improve the usefulness of this datasrructure.

KevinRansom · 2020-12-21T22:24:21Z

@krauthaufen --- It would certainly make the code review easier, if there was just the code we want to merge. Keep the extra code somewhere safe though, I'm sure it's important :-)

* removed System.ValueTuple reference

krauthaufen · 2020-12-22T09:25:00Z

@KevinRansom i reduced the file by ~1000 lines.
The remaining things are more or less necessary for the current Map API to work.

KevinRansom · 2020-12-22T18:04:57Z

@krauthaufen , thank you my friend.

KevinRansom · 2020-12-23T18:53:06Z

@krauthaufen , there is a small conflict.

# Conflicts: # src/fsharp/FSharp.Core/map.fs

krauthaufen · 2020-12-23T19:05:59Z

@KevinRansom working on it, btw. I noticed that I didn't include proper error messages in my exceptions, will do so after the whole christmas-thing...

krauthaufen · 2021-01-04T09:45:39Z

Hey, I fixed the remaining problems and the implementation is now ready for reviews. I'm of course available for questions/suggestions/etc.

forki · 2021-01-04T10:09:35Z

maybe remove the WIP then.

krauthaufen · 2021-01-04T10:13:46Z

Good point 😆

TIHan

A lot has changed, but this is implementation details and I'm confident the existing tests cover everything. From my view, this looks really good. The perf benchmarks look great too.

cartermp

This is great!

Once this is in we'll also want to update the TaggedCollections module similar to #10192 - yes, unfortunately, the compiler itself actually uses a different set of types and APIs as per #10188 (comment)

Would you be interested in making that update in a separate PR? If not it's fine, we can take care of it.

cartermp · 2021-01-05T04:44:01Z

Regarding this:

Shall I add the new Set implementation to this PR or create a separate one

A separate PR would be fantastic, yes

krauthaufen · 2021-01-05T08:39:43Z

Would you be interested in making that update in a separate PR? If not it's fine, we can take care of it.

@cartermp I can certainly update the TaggedCollections too, but maybe we should wait until Set/Map are merged?

A separate PR would be fantastic, yes

I'll start to put a Set-PR together then, but first I'd like to validate that also the Set implementation is really faster.

forki · 2021-01-05T13:23:51Z

Hey @buybackoff since you were the one who optimized this last time: could you please take a look?

buybackoff · 2021-01-05T13:53:20Z

Hi @forki

Hey @buybackoff since you were the one who optimized this last time: could you please take a look?

I could only repeat one of the top comments: @KevinRansom

There is a huge amount here, it's going to take a while to work through everything.

The huge single file changes make me scary. My changes were one-evening (map only) free lunch with keeping even code layout with pipes. There were some optimization left, e.g. devirtualizing comparer calls (comparer is always the default one for the main public Map/Set).

But initial thoughts:

It would be nice to see the most important benchmarks for Add/TryGet/Remove/Contains for 10K and 1M values.
If ofSeq/ofArray are important probably nothing will beat System.Collections.Immutable. Actually just wrapping over it may be the right solution (yet not politically correct).

These changes will take a while just to understand what's going on.

I have a question:

In the benchmarks here on a 4Hz machine contains key takes 22ns. In my PR on 3.2Hz (i8700, turbo boost off, using netcore3.1), after was 35ns, or 28ns if we adjust only for CPU frequency. In the benchmark for these changes, the before number is 55ns, that would be 68ns on my machine. And in my table, the before line is 64ns. Same math applies to getItem/tryFind.

So do we really measure the performance of the version 5.0 here?

Why not to use the exact same benchmark setup and extend it with the additional items?

Update: I linked to server GC results, the math for workstation GC is even closer.

buybackoff · 2021-01-05T14:43:13Z

Well... I have just downloaded https://www.nuget.org/packages/FSharp.Core/5.0.0 and opened it with dotPeek 🤔🤦‍♂️

Actually my changes didn't make it to 5.0.

krauthaufen · 2021-01-05T15:55:44Z

Actually my changes didn't make it to 5.0.

Oh my, so we actually know nothing about that, do we? 🤦‍♂️

I think most of my code will perform more or less equal to yours but I think ofArray, etc. will still be significantly faster. (due to its completely different approach)
Of course we would need to test that.
Sorry for the confusion, will do some benchmarks tomorrow.

Cheers

cartermp

Putting up a block for now until we figure out FSharp.Core changes. Unfortunately, yes, it's a complicated process and we slip up sometimes.

buybackoff · 2021-01-05T19:35:37Z

Numbers vs current main branch. Using code from https://github.com/krauthaufen/MapNew

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.200-preview.20601.7
  [Host] : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG
  Main   : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG

Job=Main  MaxRelativeError=0.01  BuildConfiguration=LocalBuild  
IterationCount=10  WarmupCount=1

Method	Size	Mean	Error	StdDev	Rank	Code Size	Gen 0	Gen 1	Gen 2	Allocated
getItem	100	35.381 ns	0.8438 ns	0.5581 ns	1	124 B	-	-	-	-
getItem	10000	81.619 ns	1.2875 ns	0.8516 ns	2	124 B	-	-	-	-
getItem	1000000	109.375 ns	1.1262 ns	0.6702 ns	3	124 B	-	-	-	-

getItemNew	100	29.845 ns	0.3784 ns	0.2503 ns	1	282 B	-	-	-	-
getItemNew	10000	61.976 ns	0.3408 ns	0.2028 ns	2	282 B	-	-	-	-
getItemNew	1000000	93.273 ns	1.3122 ns	0.8679 ns	3	282 B	-	-	-	-

containsKey	100	30.116 ns	0.5946 ns	0.3538 ns	1	175 B	-	-	-	-
containsKey	10000	56.167 ns	1.1283 ns	0.7463 ns	2	175 B	-	-	-	-
containsKey	1000000	94.005 ns	2.1741 ns	1.4381 ns	3	175 B	-	-	-	-

containsKeyNew	100	31.732 ns	0.2470 ns	0.1470 ns	1	223 B	-	-	-	-
containsKeyNew	10000	61.257 ns	0.3315 ns	0.2193 ns	2	223 B	-	-	-	-
containsKeyNew	1000000	92.901 ns	1.1783 ns	0.7794 ns	3	223 B	-	-	-	-

itemCount	100	188.675 ns	2.8392 ns	1.8779 ns	1	96 B	-	-	-	-
itemCount	10000	34,833.115 ns	1,329.9992 ns	879.7121 ns	2	96 B	-	-	-	-
itemCount	1000000	8,603,079.688 ns	214,615.4686 ns	141,954.8443 ns	3	96 B	-	-	-	-

itemCountNew	100	1.745 ns	0.0415 ns	0.0274 ns	1	28 B	-	-	-	-
itemCountNew	10000	1.783 ns	0.0677 ns	0.0448 ns	1	28 B	-	-	-	-
itemCountNew	1000000	2.201 ns	0.0634 ns	0.0419 ns	2	28 B	-	-	-	-

iterForeach	100	3,177.495 ns	74.5034 ns	49.2794 ns	1	283 B	0.9727	-	-	6120 B
iterForeach	10000	324,238.965 ns	4,494.5489 ns	2,674.6339 ns	2	283 B	95.2148	-	-	600120 B
iterForeach	1000000	44,529,380.000 ns	1,319,685.6302 ns	872,890.3346 ns	3	283 B	9500.0000	-	-	60000231 B

iterForeachNew	100	2,720.537 ns	39.5423 ns	26.1548 ns	1	1192 B	0.6409	-	-	4040 B
iterForeachNew	10000	282,560.364 ns	4,493.2936 ns	2,350.0794 ns	2	1192 B	63.4766	-	-	400040 B
iterForeachNew	1000000	37,056,267.857 ns	885,648.7498 ns	585,801.8121 ns	3	1192 B	6357.1429	-	-	40000129 B

addItem	100	164.311 ns	3.2224 ns	2.1314 ns	1	603 B	0.0586	0.0003	-	369 B
addItem	10000	37,396.420 ns	749.3590 ns	445.9315 ns	2	603 B	11.0156	3.5156	-	69324 B
addItem	1000000	6,349,347.900 ns	89,281.6470 ns	59,054.2815 ns	3	603 B	1600.0000	90.0000	-	10096697 B

addItemNew	100	217.753 ns	3.9022 ns	2.5811 ns	1	178 B	0.0586	0.0003	-	369 B
addItemNew	10000	47,099.414 ns	820.5227 ns	542.7249 ns	2	178 B	11.0156	3.3594	-	69324 B
addItemNew	1000000	7,763,597.333 ns	119,464.6711 ns	71,091.5095 ns	3	220 B	1600.0000	90.0000	-	10096697 B

removeItem	100	12.204 ns	0.3569 ns	0.2361 ns	1	626 B	0.0064	-	-	40 B
removeItem	10000	1,188.044 ns	13.2493 ns	8.7636 ns	2	626 B	0.6372	-	-	4000 B
removeItem	1000000	121,771.113 ns	2,547.2179 ns	1,684.8269 ns	3	626 B	63.7500	-	-	400001 B

removeItemNew	100	11.795 ns	0.3364 ns	0.2225 ns	1	145 B	0.0064	-	-	40 B
removeItemNew	10000	1,146.194 ns	18.3236 ns	12.1199 ns	2	145 B	0.6372	-	-	4000 B
removeItemNew	1000000	156,384.700 ns	46,225.3597 ns	30,575.2133 ns	3	145 B	60.0000	-	-	400000 B

krauthaufen · 2021-01-05T21:10:45Z

Hey, cool that you did some benchmarks, i will add mine as soon as i have time...
As far as i see the current implementation seems to be a little faster for some cases, nonetheless I think the ofArray performance will be significantly better.

So if that turns out to be true I'll just submit that as a PR.

yatli · 2021-01-06T15:57:04Z

@buybackoff your benchmark job is running in DEBUG:

[Host] : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG
Main : .NET Core 5.0.1 (CoreCLR 5.0.120.57516, CoreFX 5.0.120.57516), X64 RyuJIT DEBUG

Job=Main MaxRelativeError=0.01 BuildConfiguration=LocalBuild

I'm not sure why that happens -- the job should've be run in a child process in release mode (no DEBUG at the end of the line)
Try dotnet run -c Release?

buybackoff · 2021-01-06T16:26:04Z

@yatli

It's this dotnet/BenchmarkDotNet#1493.

Using the workaround doesn't change results more than a noise.

krauthaufen · 2021-01-06T18:29:24Z

Hey, I finally got to run the benchmarks and the results (for count=100) seem consistent with what we've seen so far.
After all my implementation (New in the Benchmarks) is a tiny bit slower for some operations (add, remove, etc.), a little faster for some others (tryFind, etc.). The operations drastically faster are toArray (due to the known count), ofArray (due to the merge-sort implementation)

After all I see two options here:

use the new Map implementation (and tweak it a little s.t. add/remove are on par with current)
use the current implementation and improve at least ofArray/ofList/etc. (maybe also include the count in inner nodes?)

What do you think?

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.200-preview.20601.7
  [Host]     : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT DEBUG
  Job-MJAPUR : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT

Server=False  IterationTime=100.0000 ms  MaxIterationCount=20

Method	Count	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Map_add	100	14,666.15 ns	97.249 ns	86.209 ns	1.00	0.00	6.1047	-	-	38992 B
MapNew_add	100	16,628.96 ns	135.702 ns	120.296 ns	1.13	0.01	5.9211	-	-	37360 B

Map_remove	100	14,483.32 ns	151.238 ns	134.069 ns	1.00	0.00	5.3571	-	-	34368 B
MapNew_remove	100	15,675.43 ns	208.992 ns	174.518 ns	1.08	0.02	4.6875	-	-	29608 B

Map_ofArray	100	11,584.76 ns	150.759 ns	125.890 ns	1.00	0.00	4.2098	-	-	28360 B
MapNew_ofArray	100	5,507.70 ns	64.988 ns	54.268 ns	0.48	0.01	0.8741	-	-	5600 B

Map_ofList	100	11,040.13 ns	219.175 ns	234.515 ns	1.00	0.00	4.3478	0.1087	-	27304 B
MapNew_ofList	100	6,011.46 ns	72.356 ns	67.682 ns	0.54	0.01	1.0664	-	-	6792 B

Map_ofSeq	100	12,125.45 ns	115.617 ns	102.491 ns	1.00	0.00	4.5166	0.1221	-	28552 B
MapNew_ofSeq	100	6,193.12 ns	102.929 ns	96.280 ns	0.51	0.01	1.0386	-	-	6792 B

Map_toArray	100	1,547.31 ns	23.038 ns	20.423 ns	1.00	0.00	1.0110	0.0153	-	6424 B
MapNew_toArray	100	776.14 ns	5.881 ns	4.911 ns	0.50	0.01	0.5076	0.0077	-	3224 B

Map_toList	100	1,074.89 ns	13.688 ns	12.804 ns	1.00	0.00	0.8908	0.0220	-	5600 B
MapNew_toList	100	1,127.79 ns	12.803 ns	10.691 ns	1.05	0.02	0.8852	0.0224	-	5600 B

Map_enumerate	100	3,105.46 ns	48.375 ns	45.250 ns	1.00	0.00	1.1766	-	-	7800 B
MapNew_enumerate	100	2,975.47 ns	22.260 ns	20.822 ns	0.96	0.01	0.6194	-	-	4000 B

Map_toSeq_enum	100	4,895.17 ns	41.850 ns	37.098 ns	1.00	0.00	1.6563	-	-	10472 B
MapNew_toSeq_enum	100	4,999.20 ns	55.518 ns	49.215 ns	1.02	0.01	1.0408	-	-	6600 B

Map_containsKey_all	100	2,417.70 ns	17.477 ns	15.493 ns	1.00	0.00	-	-	-	-
MapNew_containsKey_all	100	2,358.98 ns	40.268 ns	41.353 ns	0.98	0.02	-	-	-	-

Map_containsKey_nonexisting	100	27.02 ns	0.216 ns	0.180 ns	1.00	0.00	-	-	-	-
MapNew_containsKey_nonexisting	100	22.03 ns	0.168 ns	0.157 ns	0.82	0.01	-	-	-	-

Map_tryFind	100	31.46 ns	0.288 ns	0.255 ns	1.00	0.00	0.0035	-	-	24 B
MapNew_tryFind	100	23.94 ns	0.166 ns	0.147 ns	0.76	0.01	0.0038	-	-	24 B

Map_tryFind_nonexisting	100	24.36 ns	0.379 ns	0.336 ns	1.00	0.00	-	-	-	-
MapNew_tryFind_nonexisting	100	22.16 ns	0.220 ns	0.195 ns	0.91	0.02	-	-	-	-

Map_remove_all	100	10,491.85 ns	49.821 ns	38.897 ns	1.00	0.00	4.4044	-	-	28216 B
MapNew_remove_all	100	13,698.76 ns	66.820 ns	55.797 ns	1.31	0.01	3.8631	-	-	24712 B

Map_exists	100	308.65 ns	1.629 ns	1.444 ns	1.00	0.00	0.0031	-	-	24 B
MapNew_exists	100	351.49 ns	2.939 ns	2.749 ns	1.14	0.01	0.0035	-	-	24 B

Map_fold	100	258.62 ns	3.153 ns	2.633 ns	1.00	0.00	0.0026	-	-	24 B
MapNew_fold	100	328.32 ns	2.504 ns	2.342 ns	1.27	0.01	0.0033	-	-	24 B

Map_foldBack	100	264.86 ns	2.028 ns	1.797 ns	1.00	0.00	0.0027	-	-	24 B
MapNew_foldBack	100	333.76 ns	2.223 ns	1.970 ns	1.26	0.01	0.0034	-	-	24 B

buybackoff · 2021-01-06T21:01:23Z

What do you think?

Virtual methods would be the first thing I would try to remove if I wanted to optimize a hot path. It's actually surprising for me that they do not kill the performance too much. Maybe devirtualization kicks in. Also there are two isinst checks on hot paths, instead of one checking for a sealed class and a null check.
It's more OOP rather than FP. If no one cares about that, then the best optimization would be really just to wrap over System.Collections.Immutable. Minus couple of thousands of code lines in total and faster performance. But concise functional recursive AVL tree implementation is nice on it's own.
In my view, get/contains/add/remove with structural sharing (aka immutability, not thread safety) is the most important usage of immutable maps/sets. If immutability is used only for thread-safety then it's wrong to use trees, ConcurrentDictionary is 10x faster. With that, bulk operations such as AddRange may be important. Serialization is a strong point. If one cares about performance, it's unlikely that built-in serialization is of any use ever. For latency, I would prefer to deserialize in a streaming manner, i.e. from IEnumerable. For throughput and big number of values creating a copy of array and sort it just takes too much memory (but less GC in the new implementation, see next point). With immutable collections event sourcing and serializing deltas is the best, so that one could have a snapshot of a map at multiple points.
System.Collections.Immutable has a clever way of optimizing bulk create operations. They have a field with IsFrozen state, and they mutate the nodes during bulk inserts. For the current case, we may treat all nodes as mutable for ofArray/ofSeq methods until those methods return. That should reduce GC a lot.
Size matters, both the line count (912 current vs 1705 new) and compiled code size. E.g. there are 3 different implementations for Add. In most cases the one from a node is used, then at the last step it's one from a leaf. I'm not sure what BDN reports as the code size, maybe just the size of the entry point, and it doesn't count all the code size that is touched during the execution.
Set and complier collections are already done with the current version.

Huge improvement here are only for bulk creation, at the cost of array creation/copying and pre-sorting data to keep a tree balanced during addition. That could be done using Array.Sort(keys[],values[],..,comparer) method, without special cases for small count, in a new method.

So, with the arguments above, I'm for option 2 not only because it's mine and have never had a daylight 😄

krauthaufen · 2021-01-06T21:59:33Z

Devirtualiuation can't kick here since the node types are unknown and i don't think theres any type-testing involved in the virtual calls (basically a pointer-fetch, an addition and an indirect call) i have several old benchmarks that virtual calls are way faster than type-test (also requiring some kind of null check) for large branching factors (obviously not for three though)
The ImmutableDictionary, etc. weren't that awesome when i did my HashMap benchmarks but maybe that has changed since then.
The mutable in-place add is actually implemented in my map and used in ofArray for small counts

Nonetheless i totally get your point and I'm absolutely fine with keeping the current implementation. I actually implemented it for having additional functions and then stumbled upon the performance gains.

Cheers

buybackoff · 2021-01-06T22:14:18Z

Devirtualiuation can't kick here since the node types are unknown and i don't think theres any type-testing involved in the virtual calls

The virtual calls in many cases are after isinst check, maybe that works. Unfortunately all the new rules are scattered across dotnet/runtime repo, couldn't find a single place with a list of all cases after multiple improvements by @AndyAyersMS

The ImmutableDictionary, etc. weren't that awesome when i did my HashMap benchmarks but maybe that has changed since then.

From the very old discussion the benchmarks were faster: #5360 (comment)

The mutable in-place add is actually implemented in my map and used in ofArray for small counts

I mean methods like map.AddRange(pairs:IEnumerable<K,V>) -> map, i.e. optimize multiple operations with less GC but keep immutability.

krauthaufen · 2021-01-06T22:23:56Z

Here's my (also very old) HashMap comparison https://github.com/krauthaufen/ImmutableHashCollections which also included the Map (which of course is a different datastructure) for reference which showed that add wasn't too good in the ImmutableDictionary, however the lookup was terribly fast..

AndyAyersMS · 2021-01-06T22:48:35Z

find a single place with a list of all cases

Unfortunately there is no simple way to describe what can be devirtualized; it depends a lot on the generated IL, what other optimizations the jit can do, etc.

I can look if you like but it may be a day or two before I have time. If you know how to use a checked jit you can enable jit dumps and see what the jit itself has to say.

Also, we are always looking for good F# benchmarks, please consider contributing some to https://github.com/dotnet/performance.

buybackoff · 2021-01-06T22:57:10Z

@AndyAyersMS
Thanks for confirming that I didn't miss anything.

I can look if you like but it may be a day or two before I have time. If you know how to use a checked jit you can enable jit dumps and see what the jit itself has to say.

I haven't even profiled the new implementation, so cannot even say if it was devirtualized or not. It's very easy to see just in dotTrace.

For this discussion a very relevant to your work is static readonly devirtualization. It's impossible to create a static readonly field and access it directly in F#. I think if that was possible, the default comparer for primitives could be inlined. Now we do an interface call even for int32 comparer. dotTrace showns 14% in Compare for tryGetValue.

buybackoff · 2021-01-06T23:49:24Z

Improved the performance of the current version at small memory cost here: #10845

cartermp · 2021-01-07T20:03:55Z

@krauthaufen in light of the discussion here, how possible is it to extract the improvements you've made (e.g. to ToArray) independently of the rest?

krauthaufen · 2021-01-18T10:16:08Z

Hey, after doing yet another map implementation (Yam) with the insights gained here I managed to make most operations a bit faster (especially note the ofArray performance which now doesn't copy anything and therefore runs with O(1) scratch-memory) @cartermp this basically treats the tree as mutable during build (in ofArray, ofList and ofSeq) and reuses as much nodes as possible.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.746 (2004/?/20H1)
Intel Core i7-4930K CPU 3.40GHz (Haswell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT DEBUG
  Job-WQVKON : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Server=False

Method	Count	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Yam_add	100	15,401.56 ns	243.887 ns	216.199 ns	0.80	0.02	5.8289	-	-	36608 B
Map_add	100	19,254.27 ns	378.406 ns	566.381 ns	1.00	0.00	5.7373	-	-	36136 B

Yam_remove	100	15,461.79 ns	307.595 ns	554.657 ns	0.79	0.04	4.9744	-	-	31232 B
Map_remove	100	19,426.96 ns	381.377 ns	546.959 ns	1.00	0.00	5.4321	-	-	34144 B

Yam_ofArray	100	9,095.73 ns	181.949 ns	367.546 ns	0.60	0.02	0.7019	0.0153	-	4416 B
Map_ofArray	100	15,010.79 ns	194.285 ns	162.237 ns	1.00	0.00	4.4556	0.1373	-	28000 B

Yam_toArray	100	1,379.79 ns	27.016 ns	42.850 ns	0.65	0.03	0.5131	0.0114	-	3224 B
Map_toArray	100	2,122.62 ns	41.669 ns	66.091 ns	1.00	0.00	1.0223	0.0191	-	6424 B

Yam_containsKey_all	100	3,586.66 ns	55.260 ns	51.690 ns	1.00	0.08	-	-	-	-
Map_containsKey_all	100	3,622.26 ns	71.927 ns	176.438 ns	1.00	0.00	-	-	-	-

Yam_containsKey_nonexisting	100	28.18 ns	0.570 ns	0.610 ns	0.85	0.02	-	-	-	-
Map_containsKey_nonexisting	100	33.33 ns	0.581 ns	0.515 ns	1.00	0.00	-	-	-	-

Yam_exists	100	380.36 ns	5.128 ns	4.546 ns	0.92	0.01	0.0038	-	-	24 B
Map_exists	100	413.72 ns	3.673 ns	3.436 ns	1.00	0.00	0.0038	-	-	24 B

When maintaining a count per inner-node (allowing for O(1) count and O(log N) positional queries) the results are still relatively acceptable:

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.746 (2004/?/20H1)
Intel Core i7-4930K CPU 3.40GHz (Haswell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT DEBUG
  Job-NTZQCI : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Server=False

Method	Count	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
Yam_add	100	17,137.13 ns	338.754 ns	527.398 ns	0.88	0.03	5.7983	-	-	36424 B
Map_add	100	19,499.56 ns	309.162 ns	289.190 ns	1.00	0.00	5.8594	-	-	36832 B

Yam_remove	100	17,596.50 ns	344.689 ns	423.309 ns	0.98	0.03	5.8289	-	-	36672 B
Map_remove	100	17,981.42 ns	359.338 ns	413.814 ns	1.00	0.00	5.4321	-	-	34112 B

Yam_ofArray	100	11,144.64 ns	221.712 ns	217.751 ns	0.67	0.02	0.7782	0.0153	-	4912 B
Map_ofArray	100	16,669.08 ns	331.241 ns	622.151 ns	1.00	0.00	4.5776	0.1221	-	28744 B

Yam_toArray	100	1,025.46 ns	18.328 ns	30.114 ns	0.48	0.03	0.5131	0.0114	-	3224 B
Map_toArray	100	2,137.10 ns	40.995 ns	66.199 ns	1.00	0.00	1.0223	0.0191	-	6424 B

Yam_containsKey_all	100	3,575.39 ns	70.659 ns	96.719 ns	1.05	0.03	-	-	-	-
Map_containsKey_all	100	3,428.17 ns	67.072 ns	74.550 ns	1.00	0.00	-	-	-	-

Yam_containsKey_nonexisting	100	33.52 ns	0.708 ns	1.038 ns	1.08	0.03	-	-	-	-
Map_containsKey_nonexisting	100	31.19 ns	0.516 ns	0.403 ns	1.00	0.00	-	-	-	-

Yam_exists	100	366.18 ns	7.328 ns	9.268 ns	0.95	0.03	0.0038	-	-	24 B
Map_exists	100	385.18 ns	5.914 ns	6.328 ns	1.00	0.00	0.0038	-	-	24 B

However note that the Map benchmarks differ between the two runs, so I'll investigate.

If you're still interested in the (now smaller) improvements I can create a new PR (or adapt this one). Note that I have something in mind for keeping the overall count (not per inner node but globally per map) that should be quite efficient.
Please let me know what you think. The implementation is here

cartermp · 2021-01-29T17:26:13Z

Interesting. I think it'd be interesting to look at a fresh PR with your findings, with measurements against latest main.

@buybackoff

The improvements to Map have only made it to 5.0.1 nuget package, not 5.0.0 see #10768 (comment) cc @buybackoff @krauthaufen

KevinRansom · 2021-06-28T18:21:03Z

@krauthaufen ,

Big changes like this scare us to death, there are a lot of conflicts, and not much traffic for 6 months. Could you possibly find a way to parcel up the changes into smaller PRs that are easier to digest and less scary.

Closing for now

Thanks

Kevin

krauthaufen added 3 commits December 21, 2020 13:02

replaced Map implementation with https://github.com/krauthaufen/MapNew

47ae60c

raising expected exceptions (no messages atm.)

1ea6e2c

all tests running

898686b

krauthaufen changed the title ~~Improved Map performance [WIP]~~ [WIP] Improved Map performance Dec 21, 2020

KevinRansom reviewed Dec 21, 2020

View reviewed changes

krauthaufen added a commit to krauthaufen/MapNew that referenced this pull request Dec 22, 2020

backport from dotnet/fsharp#10768

9554785

* minified Map implementation

3eb3d99

* removed System.ValueTuple reference

Merge branch 'main' of github.com:dotnet/fsharp into main

b59f813

# Conflicts: # src/fsharp/FSharp.Core/map.fs

proper exception messages

ed55e68

krauthaufen changed the title ~~[WIP] Improved Map performance~~ Improved Map performance Jan 4, 2021

TIHan approved these changes Jan 4, 2021

View reviewed changes

cartermp approved these changes Jan 5, 2021

View reviewed changes

cartermp suggested changes Jan 5, 2021

View reviewed changes

buybackoff mentioned this pull request Jan 6, 2021

Map: Optimize away isinst check #10845

Merged

buybackoff mentioned this pull request Jan 8, 2021

Map: comparer optimization #10855

Closed

goswinr mentioned this pull request Feb 26, 2021

update release notes 5.0.0 and 5.0.1 ( improvements to Map) #11156

Merged

cartermp pushed a commit that referenced this pull request Feb 26, 2021

update release notes 5.0.0 and 5.0.1 ( improvements to Map) (#11156)

612c65d

The improvements to Map have only made it to 5.0.1 nuget package, not 5.0.0 see #10768 (comment) cc @buybackoff @krauthaufen

KevinRansom closed this Jun 28, 2021

Improved Map performance #10768

Improved Map performance #10768

Conversation

krauthaufen commented Dec 21, 2020

krauthaufen commented Dec 21, 2020

KevinRansom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cartermp commented Dec 21, 2020

KevinRansom commented Dec 21, 2020

krauthaufen commented Dec 21, 2020

KevinRansom commented Dec 21, 2020

krauthaufen commented Dec 22, 2020

KevinRansom commented Dec 22, 2020

KevinRansom commented Dec 23, 2020

krauthaufen commented Dec 23, 2020 • edited

krauthaufen commented Jan 4, 2021

forki commented Jan 4, 2021

krauthaufen commented Jan 4, 2021

TIHan left a comment

Choose a reason for hiding this comment

cartermp left a comment

Choose a reason for hiding this comment

cartermp commented Jan 5, 2021

krauthaufen commented Jan 5, 2021

forki commented Jan 5, 2021

buybackoff commented Jan 5, 2021 • edited

buybackoff commented Jan 5, 2021

krauthaufen commented Jan 5, 2021

cartermp left a comment

Choose a reason for hiding this comment

buybackoff commented Jan 5, 2021 • edited

krauthaufen commented Jan 5, 2021 • edited

yatli commented Jan 6, 2021 • edited

buybackoff commented Jan 6, 2021

krauthaufen commented Jan 6, 2021

buybackoff commented Jan 6, 2021

krauthaufen commented Jan 6, 2021

buybackoff commented Jan 6, 2021

krauthaufen commented Jan 6, 2021

AndyAyersMS commented Jan 6, 2021

buybackoff commented Jan 6, 2021

buybackoff commented Jan 6, 2021

cartermp commented Jan 7, 2021

krauthaufen commented Jan 18, 2021

cartermp commented Jan 29, 2021

KevinRansom commented Jun 28, 2021

krauthaufen commented Dec 23, 2020 •

edited

buybackoff commented Jan 5, 2021 •

edited

buybackoff commented Jan 5, 2021 •

edited

krauthaufen commented Jan 5, 2021 •

edited

yatli commented Jan 6, 2021 •

edited