hill climbing #191

bitfaster · 2022-08-21T01:36:36Z

Use hill climbing to optimize hit rate by adapting the size of the window and main segments based on changes in the hit rate at run time.

This significantly improves hit rate for ARC OLTP, see below.

ARC OLTP

CacheSize	ClassicLruHitRate	ConcurrentLruHitRate	ConcurrentLfuHitRate
250	16.473425988218498	15.629686756477366	24.727368196511495
500	23.445405269404745	27.128847174135394	33.69760814750395
750	28.28074320813438	32.290829135421625	37.114790323198186
1000	32.83089663018449	36.014417843996306	39.95033610641638
1250	36.209682271412085	38.6823753343288	42.33114002701978
1500	38.69604931383971	40.69409119997375	43.71144621476899
1750	40.7881681790088	42.47695934452412	45.01386541522406
2000	42.46973948334236	44.16925104879423	46.073434739565386

Previous result without hill climbing:

ARC Database

CacheSize	ClassicLruHitRate	ConcurrentLruHitRate	ConcurrentLfuHitRate
1000000	3.0858932571504036	11.94732984541647	14.149671299179376
2000000	10.744535536786323	23.242502873642838	28.39640156070473
3000000	18.5903853197138	36.190929184521515	39.71614424231626
4000000	20.244791789054513	39.605386837046645	45.10053800574032
5000000	21.031951531197397	45.15375010247688	50.872148192684094
6000000	33.95308575711706	50.37196791697349	57.59385836419041
7000000	38.8978587542623	56.082914488987626	63.89215662169152
8000000	43.03472380114861	61.00455968643755	70.23867518909964

ben-manes · 2022-08-22T05:26:00Z

BitFaster.Caching/Lfu/LfuCapacityPartition.cs

+            double amount = (hitRateChange >= 0) ? stepSize : -stepSize;
+
+            double nextStepSize = (Math.Abs(hitRateChange) >= HillClimberRestartThreshold)
+                  ? HillClimberStepPercent * (amount >= 0 ? 1 : -1)


is this supposed to be multiplied by the maximum?

Without max, amount is computed as a percent change to the ratio of window to main - since it's a percentage I just directly add it to mainRatio (also a percentage), and pass that into the ComputeQueueCapacity() function I already had. Probably I should clean this up, - I was excited to see it working.

In your code I think amount is computed as the actual number of slots, and setAdjustment() adds/subtracts this number from the main and window queue capacity (I couldn't find where this is implemented searching your code, but it seems like the amount unit must be number of slots).

Do you have any clamping to prevent drift to an invalid state?

The increase/decrease methods check that it doesn’t go beyond a bound. Sounds like it should have the same result?

Congrats on getting it working! Have you tried the stress test scenario yet (corda & loop)?

Oh in mine it’s number of units where an entry may take multiple (e.g. memory bound). Not sure if that adds a wrinkle or works fine in your approach.

I haven't tested exhaustively yet, but I think it is working well.

Uniform weight for sure makes it much simpler - I was puzzled for a while about all the different cases handled in your evictFromMain method, then I realized that weighting results in more complicated queue configurations that are unreachable for me.

I will try combining corda and loop - I made a unit test called WhenHitRateFluctuatesWindowIsAdapted that does a sanity check by just manipulating the hit rate and it works as expected. Since I copy pasted all your carefully tuned parameters and the core logic, I think on the same traces it will produce a very similar if not identical result.

It is really an ingenious addition - probably 5 or 6 lines of code that increases hit rate by > 5%.

wow, that's crazy. I wonder what AMD could be doing???

It's weird. I read in Agner Fog's architecture manual that zen rapidly adjusts the clock speed and it can make it hard to measure performance. It will be under constant load during the benchmark so it seems unlikely it would clock down, and anyway it always affects the same data size - likely it's related to the cache somehow. It's quite a big fluctuation.

I tried running with affinity to stick to a single core and did a hack to make sure the array is always at a 4 byte aligned address. Neither made any difference.

The Zen 1 can do two memory read operations or one read and one write operation, but not
two write operations, in the same clock cycle. The Zen 2 can do two memory read and one
write operation per clock cycle.

Maybe it is not realizing that the writes are to the same cache line and is creating a store buffer data dependency? Ideally it would coalesce the writes into one memory operation, but if not then I suppose it would be slower. How was the frequency performance?

That's a good point - frequency is better with block on AMD and matches expectations. The strange part is that it varies run to run - hence force alignment etc trying to reduce variables.

I pulled all the data for the eviction test, and block is equal or better in all cases:

I wondered if the difference you saw in your CI results was due to the difference between AMD and Intel architectures, but my result is not totally the same since you had both frequency and increment with similar degradation.

How does the read throughput compare? An increment is the common case, but strangely that was faster in my CI runs too even though it claimed the sketch was slower (sequenced steps on the same machine).

Very confusing, but I'm glad the end result benchmarks are all showing our ideas turned out well, but still.. 🤷‍♂️

bitfaster · 2022-10-13T04:39:50Z

I haven't tested pure read. For read+write, I tested size=500 and it's similar to eviction. From memory I think it was about 9.5 million ops/sec for flat, ~10.5 for flat avx and block, and about 11.5 million for block AVX. Block is consistently better, and the AVX version gives a larger benefit. I will do proper comparison with all sizes when I get a chance.

I can't argue with the results, but I would like to understand the cause of the fluctuations. I will run the tests on the current Azure offerings, I think the AMD skus are based on Zen 2.

Alex Peck added 10 commits August 19, 2022 18:57

notes

f3c08aa

merge

aa226c9

notes

fc5b6b7

improve OLTP

f5c1a10

poc

8244833

adapt test

ff13136

fix gap

9ec08f8

rem check

247e37c

tests

844c6c3

ws

64c4efe

bitfaster marked this pull request as ready for review August 22, 2022 03:53

bitfaster merged commit 20f8412 into main Aug 22, 2022

bitfaster deleted the users/alexpeck/climb branch August 22, 2022 03:53

ben-manes reviewed Aug 22, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hill climbing #191

hill climbing #191

bitfaster commented Aug 21, 2022 •

edited

ben-manes Aug 22, 2022

bitfaster Aug 22, 2022

ben-manes Aug 22, 2022

ben-manes Aug 22, 2022

bitfaster Aug 22, 2022

ben-manes Oct 11, 2022

bitfaster Oct 12, 2022

ben-manes Oct 12, 2022

bitfaster Oct 13, 2022

ben-manes Oct 13, 2022

bitfaster commented Oct 13, 2022

hill climbing #191

hill climbing #191

Conversation

bitfaster commented Aug 21, 2022 • edited

ARC OLTP

ARC Database

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitfaster commented Oct 13, 2022

bitfaster commented Aug 21, 2022 •

edited